This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
7/7
LoopVectorize.cpp
2/3
VPlan.h
10/10
VPlanTransforms.h
172/178
VPlanTransforms.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
10/15
deterministic-type-shrinkage.ll
6/14
loop-vectorization-factors.ll
-
type-shrinkage-insertelt.ll
-
scalable-trunc-min-bitwidth.ll
2/3
trunc-shifts.ll

Differential D149903

[VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.
ClosedPublic

Authored by fhahn on May 4 2023, 2:07 PM.

Download Raw Diff

Details

Reviewers

Ayal
gilr
rengolin

Commits

rG70535f5e609f: [VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.

Summary

This patch replaces the IR based truncateToMinimalBitwidths with a VPlan
version. This has 2 benefits:

the VPlan-based version is simpler; we don't need to implement special codegen for each supported instruction type like the IR based one.
Removes a dependency on the cost-model after VPlan execution and
Removes a use of getVPValue that uses underlying values after VPlan execution (See removed FIXME).

Depends on D149081.

Depends on D149079.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ayal added inline comments.Oct 4 2023, 4:03 PM

llvm/lib/Transforms/Vectorize/VPlan.h
281	How/Is this removal related?
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
974	(Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.
975	nit: use `LiveInInst` or something similar rather than `UI`?
981
988	Set once before the loop for all live-ins to be truncated.
997	Any order other than depth first would also do, right?
1008	(Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction. Or migrate the analysis that builds MinBWs to run on VPlan.
1009	nit: lookup.
1014	Would be good to comment how memory and replicate cases are (not) processed.
1020	Better assert than continue? Here ProcessedRecipes was already bumped, but should all MinBWs members correspond to Integer types, of distinct (smaller) size, whether live-in or not?
1030	This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other? OTOH, better handle Trunc here as well? Is it handled well below?
1034	`// SExt/Zext is redundant - stick with its operand.` ?
1041	Place assert earlier?
1043–1044
1055	This means the size of all operands is equal to NewResSizeInBits, can this be?
1059–1060	nit: keep consistent with above.
1068–1070	nit: keep consistent with above.
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71	nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as hints") to `InstCombine`. Being a public method, which does not need SE, should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

Address latest comments, apologies for the delay!

llvm/lib/Transforms/Vectorize/VPlan.h
281	The last user of this function has been removed in the patch.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
964–965	Code has been moved to D159202
967–969	code has been moved to D159202
973	Wrapped and added comment, thanks!
975	Renamed, thanks!
976	Updated, thanks!
981	Adjusted, thanks!
982	Turned into assert, thanks!
988	hoisted, thanks!
997	Yes, I think the order doesn't matter here.
1009	Done, thanks!
1014	Added a comment, thanks!
1020	Turned `isIntegerTy` into assert but retained size check as there entries where the sizes are the same (e.g. for `truncs`).
1030	Thanks, changed to `if`. I don't think Trunc is handled explicitly in the latest version.
1034	this check has been moved up and is not needed any longer.
1041	moved up,, thanks!
1043–1044	adjusted, thanks!
1055	There are cases where a Zext narrowed earlier is used as operand here, so the tie is already adjusted.
1059–1060	Adjusted, thanks!
1068–1070	reordered, thanks!

Harbormaster completed remote builds in B257842: Diff 557740.Oct 17 2023, 1:22 PM

Various comments, also trying to reason about how this patch changes tests.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3439	Retain a comment explaining why replicate recipes are not truncated?
3478	Retain this comment regarding dropping wrapping flags?
3493	A Trunc is handled by shrinking its operand.
3518	(If nothing is done to the operands, what is the result extended too?)
llvm/lib/Transforms/Vectorize/VPlan.h
281	Very well!
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
755	Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type. Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited Very well.
780	Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction? Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement? Sure, like updating any other property of a recipe when replaced.
796	Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing: A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type. A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded. B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones? C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?) D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)
973	Suffice to ask `if (!NewResSizeInBits)`?
974	Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.
977	assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.
993	Can skip phi's, none are included in MinBWs.
994	Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.
1007	Suffice to ask `if (!NewResSizeInBits)`?
1008	Thoughts about the above?
1016	Should replicate recipes be handled next to handling widen memory recipes above?
1020	nit: `ResTy` >> `OldResTy`, `ResSizeInBits` >> `OldResSizeInBits`
1023	`assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?");` here instead of below?
1029	nit: `VPC` >> `OldExt`, `Opc` >> `OldOpc`?
1030	Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its operand shrunk to the smaller bitwidth, effectively turning it into a ZExt?
1033	Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.
1034	?
1038	nit: `C` >> `NewCast`? If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?
1055	Maybe worth a comment.
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71	Thoughts on the above? Better truncate to minimal bitwidth asap, as it relies on IR information? Conceptually a scalar transform. Does "as hints to InstCombine" below still hold?
llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
41–42	hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>, originally both TMP4 and TMP10.
68	Spotted and removed duplicate zext of WIDE_LOAD8.
159	This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop.
167	BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?
168	Both insertelement's now use poison.
176	BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?
llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
302	We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>, but fail to fold a similar one following the add-2's?
330	We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is not an NFC patch), but still retain the pair of <8 x i16> => <8 x i32> => <8 x i16> after it - missed MinBW/trunc-zext opportunity?
474–475	Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result?
487	Many zext-trunc pairs left to collect.
513	Above trunc of TMP2 is redundant along with its zext in the ph.
520	Above trunc of TMP4 is redundant along with its zext in the ph.
llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334	We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity?

fhahn mentioned this in rG0c8e5be6fa08: [VPlan] Simplify redundant trunc (zext A) pairs to A..Oct 22 2023, 3:42 AM

fhahn mentioned this in rG6f3b88baa2ac: [VPlan] Move trunc ([s|z]ext A) simplifications to simplifyRecipe..Nov 16 2023, 1:17 PM

Address comments and major simplification after moving cast folding to simplifyRecipes.

Hope all comments should be addressed, hope i didn't miss any.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3439	Retained when skipping VPReplicateRecipe.
3478	Done, thanks!
3518	It stays the same, there's no extend in that case.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
755	This has been updated to now use VPTypeAnalysis.
796	The latest version doesn't have special treatment for casts, they remain unchanged and VPlan recipe simplification will take care of folding them if possible.
973	This code has now been removed; LiveIns are handled when truncating the other operands of an instruction; otherwise we leave the type info in an inconsistent state.
974	LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.
977	Turned into an assert, thanks!
993	There's an early continue now that skips phis and other unsupported recipes.
994	Nope, looks like this is not needed in the latest version.
1007	Simplified, thanks!
1008	I think it would be best to have the analysis based on VPlan. Building MinBWs early would probably require extra work to update/invalidate it during transforms.
1016	We still need to count them for verification

fhahn added inline comments.Nov 16 2023, 2:15 PM

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1020	Renamed, thanks!
1023	Done, and also removed continue
1029	This code is now gone, handled by recipe simplification.
1033	Code is gone now
1034	Code now gone.
1038	Code gone now.
llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
159	still more work to do :) Arguably the invariant instructions are artificial, in the regular pipeline, no invariant instructions should remain.
167	The latest version avoids truncating the same value twice.
168	I think the use of undef is a leftover that wasn't updated; it should be poison.
176	The latest version avoids truncating the same value twice.
llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
302	folding now happens all in simplifyRecieps, should handle this now
474–475	I think there's nothing we can do about that; we first need to splat the value when generating code, but InstCombine should take care of that.
487	Should be better cleaned up now
llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334	trunc/ext pairs should be better cleaned up in the latest version

Harbormaster completed remote builds in B258087: Diff 558116.Nov 16 2023, 6:49 PM

Looks much simpler! Minor last nits.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
962	nit: are these still hints to InstCombine, or to subsequent VPlan cleanups?
965–967	nit
971–974	?
997	But a (more) expensive RPOT order is needed, to handle defs before uses?
1021	Is it possible for MinBWs not to contain Op's live-in IR value in this case?

Address latest comments, thanks!

fhahn added inline comments.Nov 23 2023, 4:10 AM

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
962	Updated, thanks!
965–967	Done thanks! This also limits the scope of TypeInfo to the range where it is valid. after `truncateToMinimalBitwidths, we would need to invalidate the info for the modified recipes otherwise. This can be done in the future.
971–974	Simplified , thanks!
997	The latest version should not need RPO, as the bit width of the results do not change for any user (previously they might due to early cast simplifications). Changed to depth first.
1021	Yes, MinBWs only contains instructions, but not other values like arguments. Added a clarifying assert.

Harbormaster completed remote builds in B258119: Diff 558159.Nov 23 2023, 4:59 AM

ping :)

Ayal added inline comments.Nov 29 2023, 9:23 AM

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
780	Just noting potential follow-up, possibly as a TODO somewhere: attach each MinBW to its recipe when the latter is created, supplementing its underlying inst.
965–967	Very well. Worth commenting that `TypeInfo` should not be used following truncateToMinimalBitwidths.
968	nit: `ProcessedRecipesNum`?
968	`ProcessedTruncs` is used outside ifdef below, move its definition out of ifdef here? Or is it meant to ensure truncated operands are counted once by ProcessedRecipes for debugging only? If an operand is truncated multiple times, all its truncations must be to the same size, because "MinBW should specify a consistent minimal bit width for all users(, and for all operands)"? Worth explaining why processed truncs are recorded.
971	Should `PH` be skipped? Trying to shrink the (live-in) operands of recipes in PH will insert them at the end of PH...
974	Shrunk operands are placed before R, but its extension is placed after - and calls for this make_early_inc_range, right?
990	Just note that the counting of ProcessedRecipes may miss casts that fail to be processed later.
995	Does `OldResSizeInBits` equal to the size of `OldResTy`, for the non-cast Widen or Select `R`?
1012	`Ins`? Perhaps `ProcessedTrunc`?
1013	Handle the simple if !ins.second /* Op already processed */ case first, potentially early-continuing? Clearer to check if ProcessedTruncs.lookup(Op) or if ProcessedTruncs.contains(Op) and if so use ProcessedTruncs[Op], otherwise insert it?
1016	nit: place simpler if !isLiveIn case first?
1018–1021	nit
1026	Note that truncations of live-ins could also be inserted before R, thereby leaving the treatment of live-ins to debugging only, and leaving their LICM and commoning to a subsequent VPlan cleanup pass, along with trunc-zext foldings.
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71	WDYT on the above: should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

Rebase and address latest comments, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
965–967	Sunk further into truncateToMinimualBitwidths
968	Changed to `NumProcessedRecipes`
968	It's to re-use previously generated truncates. Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated) Moved out of ifdef
971	Good point, there should be nothing to shrink in PH for now, as the analysis is for the loop body only, adjusted!
974	Yep
990	Do you mean updating the comment here or just a general note? We need to include the recipes in the count, otherwise the verification later will fail
995	Yes, I forgot to remove this use of IR `getType`. Updated to use `TypeInfo.inferScalarType(ResultVPV)` and then `getScalarSizeInBits` of the returned type.
1012	Updated, thanks!
1013	Early continue would mean duplicating the code to update the operands, I left things for now as is, including using `insert`. `insert` means we only need to lookup the insert-pos once, vs 2 lookups with separate `lookup` and then `[]. WDYT?
1016	Done, thanks!
1026	Yep, for now it is simpler and results in a smaller test diff to do it directly there as it is not only LICM but also very simple CSE
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
71	Sounds good, updated, thanks!

Ayal added inline comments.Nov 29 2023, 2:23 PM

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
911	Should be the same `Ctx` passed in as parameter?
963–965	nit: redundant move of empty line?
968	Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated) Very well, may deserve a comment.
990	I mean we count casts as if they are processed, expecting they will be later, w/o checking that they actually do.
995	Ah, ok, wondered if using the size of the type of `UI` directly would be simpler?
1013	OK, WDYT of the something as follows: auto [ProcessedIter, DidNotExist] = ProcessedTruncs.insert({Op, nullptr}); VPWidenCastRecipe NewOp = DidNotExist ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy) : ProcessedIter->second; R.setOperand(Idx, NewOp); if (!DidNotExist) continue; ProcessedIter->second = NewOp; if (!Op->isLiveIn()) { Shrunk->insertBefore(&R); } else { PH->appendRecipe(Shrunk); #ifndef NDEBUG auto OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue()); bool IsContained = MinBWs.contains(OpInst); assert((!OpInst \|\| IsContained) && "All processed instructions should be contained in MinBWs."); NumProcessedRecipes += IsContained; #endif }
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
83

Ayal added inline comments.Nov 30 2023, 12:05 AM

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

1013

Maybe IterIsEmpty would be a better name, to avoid double negation, as in:

        auto [ProcessedIter, IterIsEmpty] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = IterIsEmpty ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!IterIsEmpty)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          NewOp->insertBefore(&R);
        } else {
          PH->appendRecipe(NewOp);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }

Addressed latest comments, thanks!

Harbormaster completed remote builds in B258145: Diff 558195.Nov 30 2023, 4:04 AM

fhahn added inline comments.Nov 30 2023, 5:14 AM

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
911	Yes, fixed!
963–965	changed back, thanks!
968	Added a comment to ProcessedTruncs definition.
990	They don't need handling explicitly, as redundant casts will be removed later. Expanded the comment slightly to Also skip casts which do not need to be handled explicitly here, as redundant casts will be removed during recipe simplification.
995	It might be slightly simpler, but would mean this may lead to a crash further down the line, once we support recipes without underlying values/instructions (and we forget to update this line) and/or if some other transform adjusted the type. Left as is for now
llvm/lib/Transforms/Vectorize/VPlanTransforms.h
83	Fixed, thanks!

This looks good to me, thanks for accommodating!
Adding a minor redundancy spotted plus some test related notes.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
949	redundant - hoist above the early-continue.
llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
167	Duplicated TMP0 and TMP1 still here?
176	Still seeing duplicate TMP2 and TMP3?
194–196	Trunc & insertelement LICM'd from vec.epilog.vector.body to vec.epilog.ph.
249–250	ditto.
llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
30–31	Fold zext-trunc pair, several such cases follow.
302	The one following the add-2's is also folded now.
330	Other pair also folded now.
474–475	Worth testing with a subsequent instCombine, to ensure pessimization is avoided?
487	Indeed looks like it!
llvm/test/Transforms/LoopVectorize/trunc-shifts.ll
334	Indeed!

This revision is now accepted and ready to land.Nov 30 2023, 5:22 AM

Closed by commit rG70535f5e609f: [VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version. (authored by fhahn). · Explain WhyDec 2 2023, 8:13 AM

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rG70535f5e609f: [VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version..

fhahn marked 2 inline comments as done.Dec 2 2023, 8:15 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
949	Fixed in the committed version, thanks!
llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
167	They were due to redundant casts being added for Live-in values, fixed by checking in VPWidenCastRecipe::execute for now, with a FIXME to address this with explicit unrolling.

This triggers failed asserts, see https://github.com/llvm/llvm-project/issues/74231.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

151 lines

4 lines

7 lines

90 lines

test/

Transforms/

LoopVectorize/

AArch64/

deterministic-type-shrinkage.ll

139 lines

loop-vectorization-factors.ll

388 lines

type-shrinkage-insertelt.ll

76 lines

scalable-trunc-min-bitwidth.ll

10 lines

trunc-shifts.ll

14 lines

Diff 558186

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 635 Lines • ▼ Show 20 Lines	protected:

/// Create code for the loop exit value of the reduction.		/// Create code for the loop exit value of the reduction.
void fixReduction(VPReductionPHIRecipe *Phi, VPTransformState &State);		void fixReduction(VPReductionPHIRecipe *Phi, VPTransformState &State);

/// Iteratively sink the scalarized operands of a predicated instruction into		/// Iteratively sink the scalarized operands of a predicated instruction into
/// the block that was created for it.		/// the block that was created for it.
void sinkScalarOperands(Instruction *PredInst);		void sinkScalarOperands(Instruction *PredInst);

/// Shrinks vector element sizes to the smallest bitwidth they can be legally
/// represented as.
void truncateToMinimalBitwidths(VPTransformState &State);

/// Returns (and creates if needed) the trip count of the widened loop.		/// Returns (and creates if needed) the trip count of the widened loop.
Value getOrCreateVectorTripCount(BasicBlock InsertBlock);		Value getOrCreateVectorTripCount(BasicBlock InsertBlock);

/// Returns a bitcasted value to the requested vector type.		/// Returns a bitcasted value to the requested vector type.
/// Also handles bitcasts of vector<float> <-> vector<pointer> types.		/// Also handles bitcasts of vector<float> <-> vector<pointer> types.
Value createBitOrPointerCast(Value V, VectorType *DstVTy,		Value createBitOrPointerCast(Value V, VectorType *DstVTy,
const DataLayout &DL);		const DataLayout &DL);

▲ Show 20 Lines • Show All 2,768 Lines • ▼ Show 20 Lines
}		}

static Type largestIntegerVectorType(Type T1, Type *T2) {		static Type largestIntegerVectorType(Type T1, Type *T2) {
auto *I1 = cast<IntegerType>(cast<VectorType>(T1)->getElementType());		auto *I1 = cast<IntegerType>(cast<VectorType>(T1)->getElementType());
auto *I2 = cast<IntegerType>(cast<VectorType>(T2)->getElementType());		auto *I2 = cast<IntegerType>(cast<VectorType>(T2)->getElementType());
return I1->getBitWidth() > I2->getBitWidth() ? T1 : T2;		return I1->getBitWidth() > I2->getBitWidth() ? T1 : T2;
}		}

void InnerLoopVectorizer::truncateToMinimalBitwidths(VPTransformState &State) {
// For every instruction `I` in MinBWs, truncate the operands, create a
// truncated version of `I` and reextend its result. InstCombine runs
// later and will remove any ext/trunc pairs.
SmallPtrSet<Value *, 4> Erased;
for (const auto &KV : Cost->getMinimalBitwidths()) {
// If the value wasn't vectorized, we must maintain the original scalar
// type. The absence of the value from State indicates that it
AyalUnsubmitted Done Reply Inline Actions Retain a comment explaining why replicate recipes are not truncated? Ayal: Retain a comment explaining why replicate recipes are not truncated?
fhahnAuthorUnsubmitted Done Reply Inline Actions Retained when skipping VPReplicateRecipe. fhahn: Retained when skipping VPReplicateRecipe.
// wasn't vectorized.
// FIXME: Should not rely on getVPValue at this point.
VPValue *Def = State.Plan->getVPValue(KV.first, true);
if (!State.hasAnyVectorValue(Def))
continue;
// If the instruction is defined outside the loop, only update the first
// part; the first part will be re-used for all other parts.
unsigned UFToUse = OrigLoop->contains(KV.first) ? UF : 1;
for (unsigned Part = 0; Part < UFToUse; ++Part) {
Value *I = State.get(Def, Part);
if (Erased.count(I) \|\| I->use_empty() \|\| !isa<Instruction>(I))
continue;
Type *OriginalTy = I->getType();
Type *ScalarTruncatedTy =
IntegerType::get(OriginalTy->getContext(), KV.second);
auto *TruncatedTy = VectorType::get(
ScalarTruncatedTy, cast<VectorType>(OriginalTy)->getElementCount());
if (TruncatedTy == OriginalTy)
continue;

IRBuilder<> B(cast<Instruction>(I));
auto ShrinkOperand = [&](Value V) -> Value {
if (auto *ZI = dyn_cast<ZExtInst>(V))
if (ZI->getSrcTy() == TruncatedTy)
return ZI->getOperand(0);
return B.CreateZExtOrTrunc(V, TruncatedTy);
};

// The actual instruction modification depends on the instruction type,
// unfortunately.
Value *NewI = nullptr;
if (auto *BO = dyn_cast<BinaryOperator>(I)) {
Value *Op0 = ShrinkOperand(BO->getOperand(0));
Value *Op1 = ShrinkOperand(BO->getOperand(1));
NewI = B.CreateBinOp(BO->getOpcode(), Op0, Op1);

// Any wrapping introduced by shrinking this operation shouldn't be
// considered undefined behavior. So, we can't unconditionally copy
// arithmetic wrapping flags to NewI.
AyalUnsubmitted Done Reply Inline Actions Retain this comment regarding dropping wrapping flags? Ayal: Retain this comment regarding dropping wrapping flags?
fhahnAuthorUnsubmitted Done Reply Inline Actions Done, thanks! fhahn: Done, thanks!
cast<BinaryOperator>(NewI)->copyIRFlags(I, /IncludeWrapFlags=/false);
} else if (auto *CI = dyn_cast<ICmpInst>(I)) {
Value *Op0 = ShrinkOperand(BO->getOperand(0));
Value *Op1 = ShrinkOperand(BO->getOperand(1));
NewI = B.CreateICmp(CI->getPredicate(), Op0, Op1);
} else if (auto *SI = dyn_cast<SelectInst>(I)) {
Value *TV = ShrinkOperand(SI->getTrueValue());
Value *FV = ShrinkOperand(SI->getFalseValue());
NewI = B.CreateSelect(SI->getCondition(), TV, FV);
} else if (auto *CI = dyn_cast<CastInst>(I)) {
switch (CI->getOpcode()) {
default:
llvm_unreachable("Unhandled cast!");
case Instruction::Trunc:
NewI = ShrinkOperand(CI->getOperand(0));
AyalUnsubmitted Done Reply Inline Actions A Trunc is handled by shrinking its operand. Ayal: A Trunc is handled by shrinking its operand.
break;
case Instruction::SExt:
NewI = B.CreateSExtOrTrunc(
CI->getOperand(0),
smallestIntegerVectorType(OriginalTy, TruncatedTy));
break;
case Instruction::ZExt:
NewI = B.CreateZExtOrTrunc(
CI->getOperand(0),
smallestIntegerVectorType(OriginalTy, TruncatedTy));
break;
}
} else if (auto *SI = dyn_cast<ShuffleVectorInst>(I)) {
auto Elements0 =
cast<VectorType>(SI->getOperand(0)->getType())->getElementCount();
auto *O0 = B.CreateZExtOrTrunc(
SI->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements0));
auto Elements1 =
cast<VectorType>(SI->getOperand(1)->getType())->getElementCount();
auto *O1 = B.CreateZExtOrTrunc(
SI->getOperand(1), VectorType::get(ScalarTruncatedTy, Elements1));

NewI = B.CreateShuffleVector(O0, O1, SI->getShuffleMask());
} else if (isa<LoadInst>(I) \|\| isa<PHINode>(I)) {
// Don't do anything with the operands, just extend the result.
AyalUnsubmitted Done Reply Inline Actions (If nothing is done to the operands, what is the result extended too?) Ayal: (If nothing is done to the operands, what is the result extended too?)
fhahnAuthorUnsubmitted Done Reply Inline Actions It stays the same, there's no extend in that case. fhahn: It stays the same, there's no extend in that case.
continue;
} else if (auto *IE = dyn_cast<InsertElementInst>(I)) {
auto Elements =
cast<VectorType>(IE->getOperand(0)->getType())->getElementCount();
auto *O0 = B.CreateZExtOrTrunc(
IE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements));
auto *O1 = B.CreateZExtOrTrunc(IE->getOperand(1), ScalarTruncatedTy);
NewI = B.CreateInsertElement(O0, O1, IE->getOperand(2));
} else if (auto *EE = dyn_cast<ExtractElementInst>(I)) {
auto Elements =
cast<VectorType>(EE->getOperand(0)->getType())->getElementCount();
auto *O0 = B.CreateZExtOrTrunc(
EE->getOperand(0), VectorType::get(ScalarTruncatedTy, Elements));
NewI = B.CreateExtractElement(O0, EE->getOperand(2));
} else {
// If we don't know what to do, be conservative and don't do anything.
continue;
}

// Lastly, extend the result.
NewI->takeName(cast<Instruction>(I));
Value *Res = B.CreateZExtOrTrunc(NewI, OriginalTy);
I->replaceAllUsesWith(Res);
cast<Instruction>(I)->eraseFromParent();
Erased.insert(I);
State.reset(Def, Res, Part);
}
}

// We'll have created a bunch of ZExts that are now parentless. Clean up.
for (const auto &KV : Cost->getMinimalBitwidths()) {
// If the value wasn't vectorized, we must maintain the original scalar
// type. The absence of the value from State indicates that it
// wasn't vectorized.
// FIXME: Should not rely on getVPValue at this point.
VPValue *Def = State.Plan->getVPValue(KV.first, true);
if (!State.hasAnyVectorValue(Def))
continue;
unsigned UFToUse = OrigLoop->contains(KV.first) ? UF : 1;
for (unsigned Part = 0; Part < UFToUse; ++Part) {
Value *I = State.get(Def, Part);
ZExtInst *Inst = dyn_cast<ZExtInst>(I);
if (Inst && Inst->use_empty()) {
Value *NewI = Inst->getOperand(0);
Inst->eraseFromParent();
State.reset(Def, NewI, Part);
}
}
}
}

void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,		void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
VPlan &Plan) {		VPlan &Plan) {
// Insert truncates and extends for any truncated instructions as hints to
// InstCombine.
if (VF.isVector())
truncateToMinimalBitwidths(State);

// Fix widened non-induction PHIs by setting up the PHI operands.		// Fix widened non-induction PHIs by setting up the PHI operands.
if (EnableVPlanNativePath)		if (EnableVPlanNativePath)
fixNonInductionPHIs(Plan, State);		fixNonInductionPHIs(Plan, State);

// At this point every instruction in the original loop is widened to a		// At this point every instruction in the original loop is widened to a
// vector form. Now we need to fix the recurrences in the loop. These PHI		// vector form. Now we need to fix the recurrences in the loop. These PHI
// nodes are currently empty because we did not want to introduce cycles.		// nodes are currently empty because we did not want to introduce cycles.
// This is the second stage of vectorizing recurrences.		// This is the second stage of vectorizing recurrences.
▲ Show 20 Lines • Show All 5,151 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
ElementCount MaxVF) {		ElementCount MaxVF) {
assert(OrigLoop->isInnermost() && "Inner loop expected.");		assert(OrigLoop->isInnermost() && "Inner loop expected.");

auto MaxVFTimes2 = MaxVF * 2;		auto MaxVFTimes2 = MaxVF * 2;
for (ElementCount VF = MinVF; ElementCount::isKnownLT(VF, MaxVFTimes2);) {		for (ElementCount VF = MinVF; ElementCount::isKnownLT(VF, MaxVFTimes2);) {
VFRange SubRange = {VF, MaxVFTimes2};		VFRange SubRange = {VF, MaxVFTimes2};
if (auto Plan = tryToBuildVPlanWithVPRecipes(SubRange)) {		if (auto Plan = tryToBuildVPlanWithVPRecipes(SubRange)) {
// Now optimize the initial VPlan.		// Now optimize the initial VPlan.
		if (!Plan->hasVF(ElementCount::getFixed(1)))
		VPlanTransforms::truncateToMinimalBitwidths(
		*Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
VPlanTransforms::optimize(Plan, PSE.getSE());		VPlanTransforms::optimize(Plan, PSE.getSE());

assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");		assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));		VPlans.push_back(std::move(Plan));
}		}
VF = SubRange.End;		VF = SubRange.End;
}		}
}		}

// Add the necessary canonical IV and branch recipes required to control the		// Add the necessary canonical IV and branch recipes required to control the
▲ Show 20 Lines • Show All 1,805 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.h

Show First 20 Lines • Show All 269 Lines • ▼ Show 20 Lines	struct VPTransformState {
Value get(VPValue Def, const VPIteration &Instance);		Value get(VPValue Def, const VPIteration &Instance);

bool hasVectorValue(VPValue *Def, unsigned Part) {		bool hasVectorValue(VPValue *Def, unsigned Part) {
auto I = Data.PerPartOutput.find(Def);		auto I = Data.PerPartOutput.find(Def);
return I != Data.PerPartOutput.end() && Part < I->second.size() &&		return I != Data.PerPartOutput.end() && Part < I->second.size() &&
I->second[Part];		I->second[Part];
}		}

bool hasAnyVectorValue(VPValue *Def) const {
return Data.PerPartOutput.contains(Def);
}

AyalUnsubmitted Done Reply Inline Actions How/Is this removal related? Ayal: How/Is this removal related?
fhahnAuthorUnsubmitted Done Reply Inline Actions The last user of this function has been removed in the patch. fhahn: The last user of this function has been removed in the patch.
AyalUnsubmitted Not Done Reply Inline Actions Very well! Ayal: Very well!
bool hasScalarValue(VPValue *Def, VPIteration Instance) {		bool hasScalarValue(VPValue *Def, VPIteration Instance) {
auto I = Data.PerPartScalars.find(Def);		auto I = Data.PerPartScalars.find(Def);
if (I == Data.PerPartScalars.end())		if (I == Data.PerPartScalars.end())
return false;		return false;
unsigned CacheIdx = Instance.Lane.mapToCacheIndex(VF);		unsigned CacheIdx = Instance.Lane.mapToCacheIndex(VF);
return Instance.Part < I->second.size() &&		return Instance.Part < I->second.size() &&
CacheIdx < I->second[Instance.Part].size() &&		CacheIdx < I->second[Instance.Part].size() &&
I->second[Instance.Part][CacheIdx];		I->second[Instance.Part][CacheIdx];
▲ Show 20 Lines • Show All 2,762 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlanTransforms.h

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines struct VPlanTransforms {

/// Wrap predicated VPReplicateRecipes with a mask operand in an if-then /// Wrap predicated VPReplicateRecipes with a mask operand in an if-then

/// region block and remove the mask operand. Optimize the created regions by /// region block and remove the mask operand. Optimize the created regions by

/// iteratively sinking scalar operands into the region, followed by merging /// iteratively sinking scalar operands into the region, followed by merging

/// regions until no improvements are remaining. /// regions until no improvements are remaining.

static void createAndOptimizeReplicateRegions(VPlan &Plan); static void createAndOptimizeReplicateRegions(VPlan &Plan);

/// Replace (ICMP_ULE, wide canonical IV, backedge-taken-count) checks with an /// Replace (ICMP_ULE, wide canonical IV, backedge-taken-count) checks with an

/// (active-lane-mask recipe, wide canonical IV, trip-count). If \p /// (active-lane-mask recipe, wide canonical IV, trip-count). If \p

AyalUnsubmitted

Done

nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as hints") to InstCombine.

Being a public method, which does not need SE, should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

Ayal: nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as…

AyalUnsubmitted

Done

Thoughts on the above?
Better truncate to minimal bitwidth asap, as it relies on IR information? Conceptually a scalar transform.
Does "as hints to InstCombine" below still hold?

Ayal: Thoughts on the above? Better truncate to minimal bitwidth asap, as it relies on IR information?

AyalUnsubmitted

Done

WDYT on the above: should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?

Ayal: WDYT on the above: should the caller of optimize() precede its call with a direct call to…

fhahnAuthorUnsubmitted

Done

Sounds good, updated, thanks!

fhahn: Sounds good, updated, thanks!

/// UseActiveLaneMaskForControlFlow is true, introduce an /// UseActiveLaneMaskForControlFlow is true, introduce an

/// VPActiveLaneMaskPHIRecipe. If \p DataAndControlFlowWithoutRuntimeCheck is /// VPActiveLaneMaskPHIRecipe. If \p DataAndControlFlowWithoutRuntimeCheck is

/// true, no minimum-iteration runtime check will be created (during skeleton /// true, no minimum-iteration runtime check will be created (during skeleton

/// creation) and instead it is handled using active-lane-mask. \p /// creation) and instead it is handled using active-lane-mask. \p

/// DataAndControlFlowWithoutRuntimeCheck implies \p /// DataAndControlFlowWithoutRuntimeCheck implies \p

/// UseActiveLaneMaskForControlFlow. /// UseActiveLaneMaskForControlFlow.

static void addActiveLaneMask(VPlan &Plan, static void addActiveLaneMask(VPlan &Plan,

bool UseActiveLaneMaskForControlFlow, bool UseActiveLaneMaskForControlFlow,

bool DataAndControlFlowWithoutRuntimeCheck); bool DataAndControlFlowWithoutRuntimeCheck);

/// Insert truncates and extends for any truncated recipe. Redundant casts

/// will folded later.

AyalUnsubmitted

Done

/// Insert truncates and extends for any truncated recipe. Redundant casts

- /// will folded later.

+ /// will be folded later.

static void

Ayal:

fhahnAuthorUnsubmitted

Done

Fixed, thanks!

fhahn: Fixed, thanks!

static void

truncateToMinimalBitwidths(VPlan &Plan,

const MapVector<Instruction *, uint64_t> &MinBWs,

LLVMContext &Ctx);

private: private:

/// Remove redundant VPBasicBlocks by merging them into their predecessor if /// Remove redundant VPBasicBlocks by merging them into their predecessor if

/// the predecessor has a single successor. /// the predecessor has a single successor.

static bool mergeBlocksIntoPredecessors(VPlan &Plan); static bool mergeBlocksIntoPredecessors(VPlan &Plan);

/// Remove redundant casts of inductions. /// Remove redundant casts of inductions.

/// ///

/// Such redundant casts are casts of induction variables that can be ignored, /// Such redundant casts are casts of induction variables that can be ignored,

Show All 14 Lines private:

/// the needs of vector extracts. /// the needs of vector extracts.

static void optimizeInductions(VPlan &Plan, ScalarEvolution &SE); static void optimizeInductions(VPlan &Plan, ScalarEvolution &SE);

/// Remove redundant EpxandSCEVRecipes in \p Plan's entry block by replacing /// Remove redundant EpxandSCEVRecipes in \p Plan's entry block by replacing

/// them with already existing recipes expanding the same SCEV expression. /// them with already existing recipes expanding the same SCEV expression.

static void removeRedundantExpandSCEVRecipes(VPlan &Plan); static void removeRedundantExpandSCEVRecipes(VPlan &Plan);

}; };

AyalUnsubmitted

Done

Note: a VPlan-based InstCombine could take care of these "hints" by folding redundant extend-truncate pairs.

Ayal: Note: a VPlan-based InstCombine could take care of these "hints" by folding redundant extend…

fhahnAuthorUnsubmitted

Done

Agreed, I think we already have a few separate transforms that could fit into a general instcombine transform

fhahn: Agreed, I think we already have a few separate transforms that could fit into a general…

AyalUnsubmitted

Done

The dead casts removal at the end of current truncateToMinimalBitwidths() should already be taken care of by recipe dce, right?

Ayal: The dead casts removal at the end of current truncateToMinimalBitwidths() should already be…

fhahnAuthorUnsubmitted

Done

Yes that should be taken care of.

fhahn: Yes that should be taken care of.

} // namespace llvm } // namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_VPLANTRANSFORMS_H #endif // LLVM_TRANSFORMS_VECTORIZE_VPLANTRANSFORMS_H

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

Show First 20 Lines • Show All 746 Lines • ▼ Show 20 Lines for (VPFirstOrderRecurrencePHIRecipe *FOR : RecurrencePhis) {

// all users. // all users.

RecurSplice->setOperand(0, FOR); RecurSplice->setOperand(0, FOR);

} }

return true; return true;

} }

void VPlanTransforms::clearReductionWrapFlags(VPlan &Plan) { void VPlanTransforms::clearReductionWrapFlags(VPlan &Plan) {

for (VPRecipeBase &R : for (VPRecipeBase &R :

Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) { Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {

AyalUnsubmitted

Done

nit: can return the type size in bits, as that is what is needed here. Op >> VPV?

Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.

Ayal: nit: can return the type size in bits, as that is what is needed here. Op >> VPV? Thought…

fhahnAuthorUnsubmitted

Done

Adjusted to return size in bits to simplify code, thanks!

Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.

Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited

fhahn: Adjusted to return size in bits to simplify code, thanks! > Thought: worth introducing as a…

AyalUnsubmitted

Done

Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.

Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited

Very well.

Ayal: >> Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe?

fhahnAuthorUnsubmitted

Done

This has been updated to now use VPTypeAnalysis.

fhahn: This has been updated to now use VPTypeAnalysis.

AyalUnsubmitted

Done

nit: VPValue *Op >> VPValue *VPV?

Ayal: nit: `VPValue *Op` >> `VPValue *VPV`?

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

auto *PhiR = dyn_cast<VPReductionPHIRecipe>(&R); auto *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);

if (!PhiR) if (!PhiR)

continue; continue;

const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor(); const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();

RecurKind RK = RdxDesc.getRecurrenceKind(); RecurKind RK = RdxDesc.getRecurrenceKind();

if (RK != RecurKind::Add && RK != RecurKind::Mul) if (RK != RecurKind::Add && RK != RecurKind::Mul)

continue; continue;

AyalUnsubmitted

Done

nit: worth an empty line?

Ayal: nit: worth an empty line?

fhahnAuthorUnsubmitted

Done

added, thanks!

fhahn: added, thanks!

SmallSetVector<VPValue *, 8> Worklist; SmallSetVector<VPValue *, 8> Worklist;

Worklist.insert(PhiR); Worklist.insert(PhiR);

for (unsigned I = 0; I != Worklist.size(); ++I) { for (unsigned I = 0; I != Worklist.size(); ++I) {

VPValue *Cur = Worklist[I]; VPValue *Cur = Worklist[I];

if (auto *RecWithFlags = if (auto *RecWithFlags =

dyn_cast<VPRecipeWithIRFlags>(Cur->getDefiningRecipe())) { dyn_cast<VPRecipeWithIRFlags>(Cur->getDefiningRecipe())) {

RecWithFlags->dropPoisonGeneratingFlags(); RecWithFlags->dropPoisonGeneratingFlags();

AyalUnsubmitted

Done

continue;

- auto *UI =

- cast_or_null<Instruction>(R.getVPSingleValue()->getUnderlyingValue());

+ VPValue *ResultVPV = R.getVPSingleValue();

+ auto *UI = cast_or_null<Instruction>(ResultVPV->getUnderlyingValue());

auto I = MinBWs.find(UI);

Ayal:

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

} }

AyalUnsubmitted

Done

nit: is find() ok given a null UI?

Ayal: nit: is find() ok given a null UI?

fhahnAuthorUnsubmitted

Done

Yes I think so, the keys are pointers and they shouldn't be dereferenced.

fhahn: Yes I think so, the keys are pointers and they shouldn't be dereferenced.

for (VPUser *U : Cur->users()) { for (VPUser *U : Cur->users()) {

auto *UserRecipe = dyn_cast<VPRecipeBase>(U); auto *UserRecipe = dyn_cast<VPRecipeBase>(U);

AyalUnsubmitted

Done

continue;

+ unsigned ResSizeInBits = GetSizeInBits(ResultVPV);

unsigned NewResSizeInBits = I->second;

Ayal:

fhahnAuthorUnsubmitted

Done

Adjusted, thanks!

fhahn: Adjusted, thanks!

if (!UserRecipe) if (!UserRecipe)

continue; continue;

for (VPValue *V : UserRecipe->definedValues()) for (VPValue *V : UserRecipe->definedValues())

Worklist.insert(V); Worklist.insert(V);

AyalUnsubmitted

Done

Type *ResTy = UI->getType();

- if (!ResTy->isIntegerTy() ||

- ResTy->getScalarSizeInBits() == NewResSizeInBits)

+ if (!ResTy->isIntegerTy() || ResSizeInBits == NewResSizeInBits)

continue;

Ayal:

fhahnAuthorUnsubmitted

Done

Done, thanks!

fhahn: Done, thanks!

} }

AyalUnsubmitted

Done

nit: this can be checked first, instead of checking for single defined value.

Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?

Ayal: nit: this can be checked first, instead of checking for single defined value. Thought…

fhahnAuthorUnsubmitted

Done

Moved the check up, thanks!

Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?

Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?

fhahn: Moved the check up, thanks! > Thought: could/should each MinBW be attached to its recipe asap…

AyalUnsubmitted

Not Done

Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?

Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?

Sure, like updating any other property of a recipe when replaced.

Ayal: >> Thought: could/should each MinBW be attached to its recipe asap - when the latter is created…

AyalUnsubmitted

Not Done

Just noting potential follow-up, possibly as a TODO somewhere: attach each MinBW to its recipe when the latter is created, supplementing its underlying inst.

Ayal: Just noting potential follow-up, possibly as a TODO somewhere: attach each MinBW to its recipe…

} }

AyalUnsubmitted

Done

nit: auto ResNewTyInBits = I->second;
nit: auto ResNewTy = IntegerType::get(ResTy->getContext(), ResNewTyInBits); ?

Ayal: nit: `auto ResNewTyInBits = I->second;` nit: `auto ResNewTy = IntegerType::get(ResTy…

fhahnAuthorUnsubmitted

Done

Added variables, thanks!

fhahn: Added variables, thanks!

/// Returns true is \p V is constant one. /// Returns true is \p V is constant one.

static bool isConstantOne(VPValue *V) { static bool isConstantOne(VPValue *V) {

AyalUnsubmitted

Done

nit: suffice to check isa<> and continue to work with R instead of VPW?

Ayal: nit: suffice to check isa<> and continue to work with R instead of VPW?

fhahnAuthorUnsubmitted

Done

Done, thanks!

fhahn: Done, thanks!

if (!V->isLiveIn()) if (!V->isLiveIn())

AyalUnsubmitted

Done

UI is aka UV. Better call it UI from the start, as it's an Instruction* rather than Value*.

Ayal: UI is aka UV. Better call it UI from the start, as it's an Instruction* rather than Value*.

fhahnAuthorUnsubmitted

Done

Renamed, thanks

fhahn: Renamed, thanks

return false; return false;

auto *C = dyn_cast<ConstantInt>(V->getLiveInIRValue()); auto *C = dyn_cast<ConstantInt>(V->getLiveInIRValue());

return C && C->isOne(); return C && C->isOne();

} }

/// Returns the llvm::Instruction opcode for \p R. /// Returns the llvm::Instruction opcode for \p R.

AyalUnsubmitted

Done

UI->getType() is aka ResTy. Already early-continued if it was equal in size to I->second. Can it be smaller in size than I->second? If so worth early-continuing above, if not worth asserting?

Ayal: UI->getType() is aka ResTy. Already early-continued if it was equal in size to I->second. Can…

fhahnAuthorUnsubmitted

Done

Updated to use ResTy and replace check with assert, thanks!

fhahn: Updated to use `ResTy` and replace check with assert, thanks!

static unsigned getOpcodeForRecipe(VPRecipeBase &R) { static unsigned getOpcodeForRecipe(VPRecipeBase &R) {

AyalUnsubmitted

Done

Operand of SExt/ZExt must be smaller in size than its result, so if result is at most I->second so must its operand be?

Ayal: Operand of SExt/ZExt must be smaller in size than its result, so if result is at most I->second…

fhahnAuthorUnsubmitted

Done

Current must be ResTy > NewResTy, and the operand can also be >= NewResTy I think. There also are test cases exercising the path.

fhahn: Current must be `ResTy > NewResTy`, and the operand can also be `>= NewResTy` I think. There…

AyalUnsubmitted

Done

case Instruction::ZExt: {

- assert(ResTy->getScalarSizeInBits() > NewResSizeInBits &&

- "Nothing to shrink?");

+ assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?");

if (GetSizeInBits(R.getOperand(0)) >= NewResSizeInBits)

Ayal:

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

AyalUnsubmitted

Done

nit: can set auto *Op = R.getOperand(0); for consistency with below.

Ayal: nit: can set `auto *Op = R.getOperand(0);` for consistency with below.

AyalUnsubmitted

Done

nit: can assert ResSizeInBits > NewResSizeInBits above, after early-continuing if they're equal.
Actually, they shouldn't even be equal (also compares?), assuming MinBWs is up-to-date and each insn is visited and optimized once. Current code also early-continues when equal, so replacing it with an assert can be done in a separate patch.

Ayal: nit: can assert ResSizeInBits > NewResSizeInBits above, after early-continuing if they're equal.

fhahnAuthorUnsubmitted

Done

I think the assertion might not always hold ,e.g. for truncate recipes.

fhahn: I think the assertion might not always hold ,e.g. for truncate recipes.

if (auto *WidenR = dyn_cast<VPWidenRecipe>(&R)) if (auto *WidenR = dyn_cast<VPWidenRecipe>(&R))

return WidenR->getUnderlyingInstr()->getOpcode(); return WidenR->getUnderlyingInstr()->getOpcode();

AyalUnsubmitted

Done

OK, operand < ResTy due to SExt/ZExt,
and NewResTy < ResTy due to MinBW.
NewResTy == ResTy cases should arguably be excluded from MinBWs? (independent of this patch)
Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and continue - why is the "Extend result to original width" part skipped in this case?
If OTOH operand > NewResTy a Trunc is needed rather than an Extend, and provided by subsequent code which is reached by break, followed by ZExt back to ResTy.
Otherwise if operand == NewResTy, the SExt/ZExt could be dropped, but we keep it and end up generating a redundant ZExt from R to ResTy - which have same sizes? It's probably ok because the knowledge that NewResTy bits suffice is already there, but would be good to clarify/clean up.

Ayal: OK, operand < ResTy due to SExt/ZExt, and NewResTy < ResTy due to MinBW. NewResTy == ResTy…

fhahnAuthorUnsubmitted

Done

Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and continue - why is the "Extend result to original width" part skipped in this case?

In that case, the original (wider) cast is replaced by a new (narrower) cast and there's no need to truncate.

If OTOH operand > NewResTy a Trunc is needed rather than an Extend, and provided by subsequent code which is reached by break, followed by ZExt back to ResTy.

Yep.

Otherwise if operand == NewResTy, the SExt/ZExt could be dropped, but we keep it and end up generating a redundant ZExt from R to ResTy - which have same sizes? It's probably ok because the knowledge that NewResTy bits suffice is already there, but would be good to clarify/clean up.

Yes we would at the moment generate redundant extend/trunc chains, which would indeed be good to clean up. I think we could fold those as follow-up.

fhahn: > Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy…

AyalUnsubmitted

Done

Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and continue - why is the "Extend result to original width" part skipped in this case?

In that case, the original (wider) cast is replaced by a new (narrower) cast and there's no need to truncate.

Yes, the extend-to-Res is replaced by a narrower extend-to-NewRes, but w/o another extend-back-to-Res to provide the original width, might it feed a user, say, a binary operation with mismatched size operands - where the other operand can also shrink to NewRes (as guaranteed by MinBWs) but was extended-back-to-Res? I.e., should all shrunks extend-back-to-Res, or none of them? May need better test coverage.

Ayal: >> Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy…

fhahnAuthorUnsubmitted

Done

Hm I am not sure, but if MinBWs is set the a specific bit width, wouldn't this require that all users to have the same minimal bit width for the value?

fhahn: Hm I am not sure, but if MinBWs is set the a specific bit width, wouldn't this require that all…

AyalUnsubmitted

Done

Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing:

A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type.
A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded.

B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones?

C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?)

D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)

Ayal: Agreed - MinBW should specify a consistent minimal bit width for all users, and for all…

fhahnAuthorUnsubmitted

Done

The latest version doesn't have special treatment for casts, they remain unchanged and VPlan recipe simplification will take care of folding them if possible.

fhahn: The latest version doesn't have special treatment for casts, they remain unchanged and VPlan…

if (auto *WidenC = dyn_cast<VPWidenCastRecipe>(&R)) if (auto *WidenC = dyn_cast<VPWidenCastRecipe>(&R))

AyalUnsubmitted

Done

nit: may look better to take R's opcode than UI's, but that requires casting it to VPWidenCastRecipe, so above isa maybe worth dyn_cast after all...

Ayal: nit: may look better to take R's opcode than UI's, but that requires casting it to…

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

return WidenC->getOpcode(); return WidenC->getOpcode();

if (auto *RepR = dyn_cast<VPReplicateRecipe>(&R)) if (auto *RepR = dyn_cast<VPReplicateRecipe>(&R))

return RepR->getUnderlyingInstr()->getOpcode(); return RepR->getUnderlyingInstr()->getOpcode();

if (auto *VPI = dyn_cast<VPInstruction>(&R)) if (auto *VPI = dyn_cast<VPInstruction>(&R))

return VPI->getOpcode(); return VPI->getOpcode();

return 0; return 0;

} }

/// Try to simplify recipe \p R. /// Try to simplify recipe \p R.

static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) { static void simplifyRecipe(VPRecipeBase &R, VPTypeAnalysis &TypeInfo) {

switch (getOpcodeForRecipe(R)) { switch (getOpcodeForRecipe(R)) {

case Instruction::Mul: { case Instruction::Mul: {

VPValue *A = R.getOperand(0); VPValue *A = R.getOperand(0);

AyalUnsubmitted

Done

assert Op > NewRes? What about the condition operand of select?

Ayal: assert Op > NewRes? What about the condition operand of select?

fhahnAuthorUnsubmitted

Done

Added assert, thanks!

Hmm, select would indeed be handled incorrectly, but I wasn't able to find a suitable test case. Removed VPWidenSelect for now, but will try to come up with a test case. Alternatively could leave select-handling in + assert to surface a test case, if one exists.

fhahn: Added assert, thanks! Hmm, select would indeed be handled incorrectly, but I wasn't able to…

AyalUnsubmitted

Done

Current code seems to handle selects, and compares, as well as loads and phi's - extending only their result - although MinBWs seems to exclude them(?). So Blend and WidenMemory recipes need not be considered, neither should Replicate recipe - those are to retain their current BW (hence all should extend back to ResTy rather than shrinking all to NewResTy). Worth trying to check if all insns of MinBWs were considered somehow?

Ayal: Current code seems to handle selects, and compares, as well as loads and phi's - extending only…

fhahnAuthorUnsubmitted

Done

Updated to also handle selects and replicate recipes. New tests should have been added a while ago.

I also added an assert checking if the number of processed instructions matches MinBWs.size().

fhahn: Updated to also handle selects and replicate recipes. New tests should have been added a while…

VPValue *B = R.getOperand(1); VPValue *B = R.getOperand(1);

if (isConstantOne(A)) if (isConstantOne(A))

AyalUnsubmitted

Done

continue;

- auto *Shrunk = new VPWidenCastRecipe(

- Instruction::Trunc, Op, IntegerType::get(Ctx, NewResSizeInBits));

+ auto *Shrunk = new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy);

R.setOperand(Idx, Shrunk);

Ayal:

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

return R.getVPSingleValue()->replaceAllUsesWith(B); return R.getVPSingleValue()->replaceAllUsesWith(B);

AyalUnsubmitted

Done

nit: first take care of creating and inserting Shrunk, then take care of R's flags drop and operand set?

Ayal: nit: first take care of creating and inserting Shrunk, then take care of R's flags drop and…

fhahnAuthorUnsubmitted

Done

Done, thanks!

fhahn: Done, thanks!

if (isConstantOne(B)) if (isConstantOne(B))

return R.getVPSingleValue()->replaceAllUsesWith(A); return R.getVPSingleValue()->replaceAllUsesWith(A);

break; break;

AyalUnsubmitted

Done

R.getOperand(Idx) is aka Op.

Ayal: R.getOperand(Idx) is aka Op.

fhahnAuthorUnsubmitted

Done

Done, thanks!

fhahn: Done, thanks!

} }

case Instruction::Trunc: { case Instruction::Trunc: {

VPRecipeBase *Ext = R.getOperand(0)->getDefiningRecipe(); VPRecipeBase *Ext = R.getOperand(0)->getDefiningRecipe();

if (!Ext) if (!Ext)

break; break;

AyalUnsubmitted

Done

// Extend result to original width.

- auto *Ext =

- new VPWidenCastRecipe(Instruction::ZExt, R.getVPSingleValue(), ResTy);

+ auto *Ext = new VPWidenCastRecipe(Instruction::ZExt, ResultVPV, ResTy);

ResultVPV->replaceAllUsesWith(Ext);

Ayal:

fhahnAuthorUnsubmitted

Done

Done, thanks!

fhahn: Done, thanks!

unsigned ExtOpcode = getOpcodeForRecipe(*Ext); unsigned ExtOpcode = getOpcodeForRecipe(*Ext);

if (ExtOpcode != Instruction::ZExt && ExtOpcode != Instruction::SExt) if (ExtOpcode != Instruction::ZExt && ExtOpcode != Instruction::SExt)

AyalUnsubmitted

Done

ResultVPV->replaceAllUsesWith(Ext);

- Ext->setOperand(0, R.getVPSingleValue());

+ Ext->setOperand(0, ResultVPV);

Ext->insertAfter(&R);

Ayal:

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

break; break;

AyalUnsubmitted

Done

nit: define auto *RVPValue = R.getVPSingleValue() above?

Would be good to have a common base class for all recipes having a single value, as this amounts to a cast.

Ayal: nit: define `auto *RVPValue = R.getVPSingleValue()` above? Would be good to have a common base…

fhahnAuthorUnsubmitted

Done

nit: define auto *RVPValue = R.getVPSingleValue() above?

Done thanks!

Would be good to have a common base class for all recipes having a single value, as this amounts to a cast.

Yes, I think that came up in earlier patches as well.

fhahn: > nit: define auto *RVPValue = R.getVPSingleValue() above? Done thanks! > Would be good to…

VPValue *A = Ext->getOperand(0); VPValue *A = Ext->getOperand(0);

VPValue *Trunc = R.getVPSingleValue(); VPValue *Trunc = R.getVPSingleValue();

Type *TruncTy = TypeInfo.inferScalarType(Trunc); Type *TruncTy = TypeInfo.inferScalarType(Trunc);

AyalUnsubmitted

Done

Other insertions of shrunk operands and smaller extends are placed before R; this one is placed after - and calls for make_early_inc_range, right?

Ayal: Other insertions of shrunk operands and smaller extends are placed before R; this one is placed…

fhahnAuthorUnsubmitted

Done

Yep.

fhahn: Yep.

Type *ATy = TypeInfo.inferScalarType(A); Type *ATy = TypeInfo.inferScalarType(A);

if (TruncTy == ATy) { if (TruncTy == ATy) {

Trunc->replaceAllUsesWith(A); Trunc->replaceAllUsesWith(A);

} else if (ATy->getScalarSizeInBits() < TruncTy->getScalarSizeInBits()) { } else if (ATy->getScalarSizeInBits() < TruncTy->getScalarSizeInBits()) {

auto *VPC = auto *VPC =

new VPWidenCastRecipe(Instruction::CastOps(ExtOpcode), A, TruncTy); new VPWidenCastRecipe(Instruction::CastOps(ExtOpcode), A, TruncTy);

VPC->insertBefore(&R); VPC->insertBefore(&R);

Trunc->replaceAllUsesWith(VPC); Trunc->replaceAllUsesWith(VPC);

Show All 29 Lines static void simplifyRecipes(VPlan &Plan, LLVMContext &Ctx) {

VPTypeAnalysis TypeInfo(Ctx); VPTypeAnalysis TypeInfo(Ctx);

for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) { for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {

for (VPRecipeBase &R : make_early_inc_range(*VPBB)) { for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {

simplifyRecipe(R, TypeInfo); simplifyRecipe(R, TypeInfo);

} }

void VPlanTransforms::truncateToMinimalBitwidths(

VPlan &Plan, const MapVector<Instruction *, uint64_t> &MinBWs,

LLVMContext &Ctx) {

#ifndef NDEBUG

unsigned NumProcessedRecipes = 0;

#endif

// Keep track of created truncates, so they can be re-used.

DenseMap<VPValue *, VPWidenCastRecipe *> ProcessedTruncs;

VPTypeAnalysis TypeInfo(Ctx);

VPBasicBlock *PH = Plan.getEntry();

for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(

vp_depth_first_deep(Plan.getVectorLoopRegion()))) {

for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {

if (!isa<VPWidenRecipe, VPWidenCastRecipe, VPReplicateRecipe,

VPWidenSelectRecipe>(&R))

continue;

VPValue *ResultVPV = R.getVPSingleValue();

auto *UI = cast_or_null<Instruction>(ResultVPV->getUnderlyingValue());

unsigned NewResSizeInBits = MinBWs.lookup(UI);

if (!NewResSizeInBits)

continue;

#ifndef NDEBUG

NumProcessedRecipes++;

#endif

// If the value wasn't vectorized, we must maintain the original scalar

// type. Skip those here, after incrementing NumProcessedRecipes. Also

// skip casts, as redundant casts will be removed during recipe

// simplification.

if (isa<VPReplicateRecipe, VPWidenCastRecipe>(&R))

continue;

Type *OldResTy = TypeInfo.inferScalarType(ResultVPV);

unsigned OldResSizeInBits = OldResTy->getScalarSizeInBits();

assert(OldResTy->isIntegerTy() && "only integer types supported");

assert(OldResSizeInBits > NewResSizeInBits && "Nothing to shrink?");

LLVMContext &Ctx = OldResTy->getContext();

AyalUnsubmitted

Done

Should be the same Ctx passed in as parameter?

Ayal: Should be the same `Ctx` passed in as parameter?

fhahnAuthorUnsubmitted

Done

Yes, fixed!

fhahn: Yes, fixed!

auto *NewResTy = IntegerType::get(Ctx, NewResSizeInBits);

// Shrink operands by introducing truncates as needed.

unsigned StartIdx = isa<VPWidenSelectRecipe>(&R) ? 1 : 0;

for (unsigned Idx = StartIdx; Idx != R.getNumOperands(); ++Idx) {

auto *Op = R.getOperand(Idx);

unsigned OpSizeInBits =

TypeInfo.inferScalarType(Op)->getScalarSizeInBits();

if (OpSizeInBits == NewResSizeInBits)

continue;

assert(OpSizeInBits > NewResSizeInBits && "nothing to truncate");

auto ProcessedTrunc = ProcessedTruncs.insert({Op, nullptr});

if (ProcessedTrunc.second) {

auto Shrunk = new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy);

ProcessedTrunc.first->second = Shrunk;

if (!Op->isLiveIn()) {

Shrunk->insertBefore(&R);

} else {

#ifndef NDEBUG

auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());

bool IsContained = MinBWs.contains(OpInst);

assert((!OpInst || IsContained) &&

"All processed instructions should be contained in MinBWs.");

NumProcessedRecipes += IsContained;

#endif

PH->appendRecipe(Shrunk);

}

R.setOperand(Idx, ProcessedTrunc.first->second);

}

// Any wrapping introduced by shrinking this operation shouldn't be

// considered undefined behavior. So, we can't unconditionally copy

// arithmetic wrapping flags to VPW.

if (auto *VPW = dyn_cast<VPRecipeWithIRFlags>(&R))

VPW->dropPoisonGeneratingFlags();

// Extend result to original width.

AyalUnsubmitted

Not Done

#endif

}

- R.setOperand(Idx, ProcessedIter->second);

}

// Any wrapping introduced by shrinking this operation shouldn't be

redundant - hoist above the early-continue.

Ayal: redundant - hoist above the early-continue.

fhahnAuthorUnsubmitted

Done

Fixed in the committed version, thanks!

fhahn: Fixed in the committed version, thanks!

auto *Ext = new VPWidenCastRecipe(Instruction::ZExt, ResultVPV, OldResTy);

Ext->insertAfter(&R);

ResultVPV->replaceAllUsesWith(Ext);

Ext->setOperand(0, ResultVPV);

}

assert(MinBWs.size() == NumProcessedRecipes &&

"some entries in MinBWs haven't been processed");

}

void VPlanTransforms::optimize(VPlan &Plan, ScalarEvolution &SE) { void VPlanTransforms::optimize(VPlan &Plan, ScalarEvolution &SE) {

removeRedundantCanonicalIVs(Plan); removeRedundantCanonicalIVs(Plan);

AyalUnsubmitted

Done

nit: are these still hints to InstCombine, or to subsequent VPlan cleanups?

Ayal: nit: are these still hints to InstCombine, or to subsequent VPlan cleanups?

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

removeRedundantInductionCasts(Plan); removeRedundantInductionCasts(Plan);

optimizeInductions(Plan, SE); optimizeInductions(Plan, SE);

AyalUnsubmitted

Done

auto GetSizeInBits = [](VPValue *VPV) {

- auto *UV = VPV->getUnderlyingValue();

- if (UV)

+ if (auto *UV = VPV->getUnderlyingValue())

return UV->getType()->getScalarSizeInBits();

nit

Ayal: nit

fhahnAuthorUnsubmitted

Done

Code has been moved to D159202

fhahn: Code has been moved to D159202

AyalUnsubmitted

Done

removeRedundantCanonicalIVs(Plan);

removeRedundantInductionCasts(Plan);

- optimizeInductions(Plan, SE);

+ optimizeInductions(Plan, SE);

simplifyRecipes(Plan, SE.getContext());

nit: redundant move of empty line?

Ayal: nit: redundant move of empty line?

fhahnAuthorUnsubmitted

Done

changed back, thanks!

fhahn: changed back, thanks!

simplifyRecipes(Plan, SE.getContext()); simplifyRecipes(Plan, SE.getContext());

removeDeadRecipes(Plan); removeDeadRecipes(Plan);

AyalUnsubmitted

Done

optimizeInductions(Plan, SE);

- VPTypeAnalysis TypeInfo(SE.getContext());

- if (!Plan.hasVF(ElementCount::getFixed(1)))

+ if (!Plan.hasVF(ElementCount::getFixed(1))) {

+ VPTypeAnalysis TypeInfo(SE.getContext());

truncateToMinimalBitwidths(Plan, MinBWs, TypeInfo);

+ }

simplifyRecipes(Plan, SE.getContext());

nit

Ayal: nit

fhahnAuthorUnsubmitted

Done

Done thanks! This also limits the scope of TypeInfo to the range where it is valid. after `truncateToMinimalBitwidths, we would need to invalidate the info for the modified recipes otherwise. This can be done in the future.

fhahn: Done thanks! This also limits the scope of TypeInfo to the range where it is valid. after…

AyalUnsubmitted

Done

Very well. Worth commenting that TypeInfo should not be used following truncateToMinimalBitwidths.

Ayal: Very well. Worth commenting that `TypeInfo` should not be used following…

fhahnAuthorUnsubmitted

Done

Sunk further into truncateToMinimualBitwidths

fhahn: Sunk further into truncateToMinimualBitwidths

AyalUnsubmitted

Done

nit: ProcessedRecipesNum?

Ayal: nit: `ProcessedRecipesNum`?

fhahnAuthorUnsubmitted

Done

Changed to NumProcessedRecipes

fhahn: Changed to `NumProcessedRecipes`

AyalUnsubmitted

Done

ProcessedTruncs is used outside ifdef below, move its definition out of ifdef here? Or is it meant to ensure truncated operands are counted once by ProcessedRecipes for debugging only? If an operand is truncated multiple times, all its truncations must be to the same size, because "MinBW should specify a consistent minimal bit width for all users(, and for all operands)"?

Worth explaining why processed truncs are recorded.

Ayal: `ProcessedTruncs` is used outside ifdef below, move its definition out of ifdef here? Or is it…

fhahnAuthorUnsubmitted

Done

It's to re-use previously generated truncates. Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Moved out of ifdef

fhahn: It's to re-use previously generated truncates. Note that we cannot RAUW after creating the new…

AyalUnsubmitted

Done

Note that we cannot RAUW after creating the new truncate, as this may make other uses not well typed (until they are processed and all their operands are truncated)

Very well, may deserve a comment.

Ayal: > Note that we cannot RAUW after creating the new truncate, as this may make other uses not…

fhahnAuthorUnsubmitted

Done

Added a comment to ProcessedTruncs definition.

fhahn: Added a comment to ProcessedTruncs definition.

createAndOptimizeReplicateRegions(Plan); createAndOptimizeReplicateRegions(Plan);

AyalUnsubmitted

Done

return UV->getType()->getScalarSizeInBits();

- if (auto *VPC = dyn_cast<VPWidenCastRecipe>(VPV)) {

+ if (auto *VPC = dyn_cast<VPWidenCastRecipe>(VPV))

return VPC->getResultType()->getScalarSizeInBits();

- }

llvm_unreachable("trying to get type of a VPValue without type info");

nit

Ayal: nit

fhahnAuthorUnsubmitted

Done

code has been moved to D159202

fhahn: code has been moved to D159202

removeRedundantExpandSCEVRecipes(Plan); removeRedundantExpandSCEVRecipes(Plan);

AyalUnsubmitted

Done

Should PH be skipped? Trying to shrink the (live-in) operands of recipes in PH will insert them at the end of PH...

Ayal: Should `PH` be skipped? Trying to shrink the (live-in) operands of recipes in PH will insert…

fhahnAuthorUnsubmitted

Done

Good point, there should be nothing to shrink in PH for now, as the analysis is for the loop body only, adjusted!

fhahn: Good point, there should be nothing to shrink in PH for now, as the analysis is for the loop…

mergeBlocksIntoPredecessors(Plan); mergeBlocksIntoPredecessors(Plan);

} }

AyalUnsubmitted

Done

Define ProcessedRecipes only for debug?

/// First truncate live-ins that represent relevant Instructions.

Ayal: Define `ProcessedRecipes` only for debug? /// First truncate live-ins that represent relevant…

fhahnAuthorUnsubmitted

Done

Wrapped and added comment, thanks!

fhahn: Wrapped and added comment, thanks!

AyalUnsubmitted

Done

Suffice to ask if (!NewResSizeInBits)?

Ayal: Suffice to ask `if (!NewResSizeInBits)`?

fhahnAuthorUnsubmitted

Done

This code has now been removed; LiveIns are handled when truncating the other operands of an instruction; otherwise we leave the type info in an inconsistent state.

fhahn: This code has now been removed; LiveIns are handled when truncating the other operands of an…

AyalUnsubmitted

Done

(Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.

Ayal: (Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it…

AyalUnsubmitted

Done

Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.

Ayal: Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a…

fhahnAuthorUnsubmitted

Done

LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.

fhahn: LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.

AyalUnsubmitted

Done

#endif

- VPBasicBlock *PH =

- cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSinglePredecessor());

- ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(

- Plan.getEntry());

+ VPBasicBlock *PH = Plan.getEntry();

+ ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(PH);

for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {

Ayal: ?

fhahnAuthorUnsubmitted

Done

Simplified , thanks!

fhahn: Simplified , thanks!

AyalUnsubmitted

Done

Shrunk operands are placed before R, but its extension is placed after - and calls for this make_early_inc_range, right?

Ayal: Shrunk operands are placed before R, but its extension is placed after - and calls for this…

fhahnAuthorUnsubmitted

Done

Yep

fhahn: Yep

// Add a VPActiveLaneMaskPHIRecipe and related recipes to \p Plan and replace // Add a VPActiveLaneMaskPHIRecipe and related recipes to \p Plan and replace

AyalUnsubmitted

Done

nit: use LiveInInst or something similar rather than UI?

Ayal: nit: use `LiveInInst` or something similar rather than `UI`?

fhahnAuthorUnsubmitted

Done

Renamed, thanks!

fhahn: Renamed, thanks!

// the loop terminator with a branch-on-cond recipe with the negated // the loop terminator with a branch-on-cond recipe with the negated

AyalUnsubmitted

Done

Would `MinBWs.lookup(UI) look better? Returning zero clearly indicates unfound.

Ayal: Would ``MinBWs.lookup(UI)` look better? Returning zero clearly indicates unfound.

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

// active-lane-mask as operand. Note that this turns the loop into an // active-lane-mask as operand. Note that this turns the loop into an

AyalUnsubmitted

Done

assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.

Ayal: assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.

fhahnAuthorUnsubmitted

Done

Turned into an assert, thanks!

fhahn: Turned into an assert, thanks!

// uncountable one. Only the existing terminator is replaced, all other existing // uncountable one. Only the existing terminator is replaced, all other existing

// recipes/users remain unchanged, except for poison-generating flags being // recipes/users remain unchanged, except for poison-generating flags being

// dropped from the canonical IV increment. Return the created // dropped from the canonical IV increment. Return the created

// VPActiveLaneMaskPHIRecipe. // VPActiveLaneMaskPHIRecipe.

AyalUnsubmitted

Done

unsigned NewResSizeInBits = I->second;

- Type *ResTy = VPV->getLiveInIRValue()->getType();

+ Type *ResTy = UI->getType();

if (!ResTy->isIntegerTy())

Ayal:

fhahnAuthorUnsubmitted

Done

Adjusted, thanks!

fhahn: Adjusted, thanks!

// //

AyalUnsubmitted

Done

Can this happen - continuing will lose a member of MinBWs - better assert instead?

Ayal: Can this happen - continuing will lose a member of MinBWs - better assert instead?

fhahnAuthorUnsubmitted

Done

Turned into assert, thanks!

fhahn: Turned into assert, thanks!

// The function uses the following definitions: // The function uses the following definitions:

// //

// %TripCount = DataWithControlFlowWithoutRuntimeCheck ? // %TripCount = DataWithControlFlowWithoutRuntimeCheck ?

// calculate-trip-count-minus-VF (original TC) : original TC // calculate-trip-count-minus-VF (original TC) : original TC

// %IncrementValue = DataWithControlFlowWithoutRuntimeCheck ? // %IncrementValue = DataWithControlFlowWithoutRuntimeCheck ?

// CanonicalIVPhi : CanonicalIVIncrement // CanonicalIVPhi : CanonicalIVIncrement

AyalUnsubmitted

Done

auto *Shrunk = new VPWidenCastRecipe(Instruction::Trunc, VPV, NewResTy);

- VPBasicBlock *PH = dyn_cast<VPBasicBlock>(

+ VPBasicBlock *PH = cast<VPBasicBlock>(

Plan.getVectorLoopRegion()->getSinglePredecessor());

Set once before the loop for all live-ins to be truncated.

Ayal: Set once before the loop for all live-ins to be truncated.

fhahnAuthorUnsubmitted

Done

hoisted, thanks!

fhahn: hoisted, thanks!

// %StartV is the canonical induction start value. // %StartV is the canonical induction start value.

// //

AyalUnsubmitted

Done

Just note that the counting of ProcessedRecipes may miss casts that fail to be processed later.

Ayal: Just note that the counting of ProcessedRecipes may miss casts that fail to be processed later.

fhahnAuthorUnsubmitted

Done

Do you mean updating the comment here or just a general note? We need to include the recipes in the count, otherwise the verification later will fail

fhahn: Do you mean updating the comment here or just a general note? We need to include the recipes in…

AyalUnsubmitted

Done

I mean we count casts as if they are processed, expecting they will be later, w/o checking that they actually do.

Ayal: I mean we count casts as if they are processed, expecting they will be later, w/o checking that…

fhahnAuthorUnsubmitted

Done

They don't need handling explicitly, as redundant casts will be removed later. Expanded the comment slightly to

Also skip casts which do not need to be handled explicitly here, as redundant casts will be removed during recipe simplification.

fhahn: They don't need handling explicitly, as redundant casts will be removed later. Expanded the…

// The function adds the following recipes: // The function adds the following recipes:

// //

// vector.ph: // vector.ph:

AyalUnsubmitted

Done

Can skip phi's, none are included in MinBWs.

Ayal: Can skip phi's, none are included in MinBWs.

fhahnAuthorUnsubmitted

Done

There's an early continue now that skips phis and other unsupported recipes.

fhahn: There's an early continue now that skips phis and other unsupported recipes.

// %TripCount = calculate-trip-count-minus-VF (original TC) // %TripCount = calculate-trip-count-minus-VF (original TC)

AyalUnsubmitted

Done

Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.

Ayal: Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.

fhahnAuthorUnsubmitted

Done

Nope, looks like this is not needed in the latest version.

fhahn: Nope, looks like this is not needed in the latest version.

// [if DataWithControlFlowWithoutRuntimeCheck] // [if DataWithControlFlowWithoutRuntimeCheck]

AyalUnsubmitted

Done

Does OldResSizeInBits equal to the size of OldResTy, for the non-cast Widen or Select R?

Ayal: Does `OldResSizeInBits` equal to the size of `OldResTy`, for the non-cast Widen or Select `R`?

fhahnAuthorUnsubmitted

Done

Yes, I forgot to remove this use of IR getType. Updated to use TypeInfo.inferScalarType(ResultVPV) and then getScalarSizeInBits of the returned type.

fhahn: Yes, I forgot to remove this use of IR `getType`. Updated to use ` TypeInfo.inferScalarType…

AyalUnsubmitted

Done

Ah, ok, wondered if using the size of the type of UI directly would be simpler?

Ayal: Ah, ok, wondered if using the size of the type of `UI` directly would be simpler?

fhahnAuthorUnsubmitted

Done

It might be slightly simpler, but would mean this may lead to a crash further down the line, once we support recipes without underlying values/instructions (and we forget to update this line) and/or if some other transform adjusted the type. Left as is for now

fhahn: It might be slightly simpler, but would mean this may lead to a crash further down the line…

// %EntryInc = canonical-iv-increment-for-part %StartV // %EntryInc = canonical-iv-increment-for-part %StartV

// %EntryALM = active-lane-mask %EntryInc, %TripCount // %EntryALM = active-lane-mask %EntryInc, %TripCount

AyalUnsubmitted

Done

Any order other than depth first would also do, right?

Ayal: Any order other than depth first would also do, right?

fhahnAuthorUnsubmitted

Done

Yes, I think the order doesn't matter here.

fhahn: Yes, I think the order doesn't matter here.

AyalUnsubmitted

Done

But a (more) expensive RPOT order is needed, to handle defs before uses?

Ayal: But a (more) expensive RPOT order is needed, to handle defs before uses?

fhahnAuthorUnsubmitted

Done

The latest version should not need RPO, as the bit width of the results do not change for any user (previously they might due to early cast simplifications). Changed to depth first.

fhahn: The latest version should not need RPO, as the bit width of the results do not change for any…

// //

// vector.body: // vector.body:

// ... // ...

// %P = active-lane-mask-phi [ %EntryALM, %vector.ph ], [ %ALM, %vector.body ] // %P = active-lane-mask-phi [ %EntryALM, %vector.ph ], [ %ALM, %vector.body ]

// ... // ...

// %InLoopInc = canonical-iv-increment-for-part %IncrementValue // %InLoopInc = canonical-iv-increment-for-part %IncrementValue

// %ALM = active-lane-mask %InLoopInc, TripCount // %ALM = active-lane-mask %InLoopInc, TripCount

// %Negated = Not %ALM // %Negated = Not %ALM

// branch-on-cond %Negated // branch-on-cond %Negated

// //

AyalUnsubmitted

Done

Suffice to ask if (!NewResSizeInBits)?

Ayal: Suffice to ask `if (!NewResSizeInBits)`?

fhahnAuthorUnsubmitted

Done

Simplified, thanks!

fhahn: Simplified, thanks!

static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch( static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(

AyalUnsubmitted

Done

(Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction.
Or migrate the analysis that builds MinBWs to run on VPlan.

Ayal: (Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each…

AyalUnsubmitted

Done

Thoughts about the above?

Ayal: Thoughts about the above?

fhahnAuthorUnsubmitted

Done

I think it would be best to have the analysis based on VPlan. Building MinBWs early would probably require extra work to update/invalidate it during transforms.

fhahn: I think it would be best to have the analysis based on VPlan. Building MinBWs early would…

VPlan &Plan, bool DataAndControlFlowWithoutRuntimeCheck) { VPlan &Plan, bool DataAndControlFlowWithoutRuntimeCheck) {

AyalUnsubmitted

Done

nit: lookup.

Ayal: nit: lookup.

fhahnAuthorUnsubmitted

Done

Done, thanks!

fhahn: Done, thanks!

VPRegionBlock *TopRegion = Plan.getVectorLoopRegion(); VPRegionBlock *TopRegion = Plan.getVectorLoopRegion();

VPBasicBlock *EB = TopRegion->getExitingBasicBlock(); VPBasicBlock *EB = TopRegion->getExitingBasicBlock();

auto *CanonicalIVPHI = Plan.getCanonicalIV(); auto *CanonicalIVPHI = Plan.getCanonicalIV();

AyalUnsubmitted

Done

Ins? Perhaps ProcessedTrunc?

Ayal: `Ins`? Perhaps `ProcessedTrunc`?

fhahnAuthorUnsubmitted

Done

Updated, thanks!

fhahn: Updated, thanks!

VPValue *StartV = CanonicalIVPHI->getStartValue(); VPValue *StartV = CanonicalIVPHI->getStartValue();

AyalUnsubmitted

Done

Handle the simple if !ins.second /* Op already processed */ case first, potentially early-continuing?

Clearer to check if ProcessedTruncs.lookup(Op) or if ProcessedTruncs.contains(Op) and if so use ProcessedTruncs[Op], otherwise insert it?

Ayal: Handle the simple if !ins.second /* Op already processed */ case first, potentially early…

fhahnAuthorUnsubmitted

Done

Early continue would mean duplicating the code to update the operands, I left things for now as is, including using insert. insert means we only need to lookup the insert-pos once, vs 2 lookups with separate lookup and then `[]. WDYT?

fhahn: Early continue would mean duplicating the code to update the operands, I left things for now…

AyalUnsubmitted

Not Done

OK, WDYT of the something as follows:

        auto [ProcessedIter, DidNotExist] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = DidNotExist ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!DidNotExist)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          Shrunk->insertBefore(&R);
        } else {
          PH->appendRecipe(Shrunk);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }

Ayal: OK, WDYT of the something as follows: ``` auto [ProcessedIter, DidNotExist] =…

AyalUnsubmitted

Not Done

Maybe IterIsEmpty would be a better name, to avoid double negation, as in:

        auto [ProcessedIter, IterIsEmpty] = ProcessedTruncs.insert({Op, nullptr});
        VPWidenCastRecipe *NewOp = IterIsEmpty ? new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy)
                                               : ProcessedIter->second;
        R.setOperand(Idx, NewOp);
        if (!IterIsEmpty)
          continue;
        ProcessedIter->second = NewOp;
        if (!Op->isLiveIn()) {
          NewOp->insertBefore(&R);
        } else {
          PH->appendRecipe(NewOp);
#ifndef NDEBUG
          auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());
          bool IsContained = MinBWs.contains(OpInst);
          assert((!OpInst || IsContained) &&
                 "All processed instructions should be contained in MinBWs.");
          NumProcessedRecipes += IsContained;
#endif
        }

Ayal: Maybe `IterIsEmpty` would be a better name, to avoid double negation, as in: ``` auto…

AyalUnsubmitted

Done

Would be good to comment how memory and replicate cases are (not) processed.

Ayal: Would be good to comment how memory and replicate cases are (not) processed.

fhahnAuthorUnsubmitted

Done

Added a comment, thanks!

fhahn: Added a comment, thanks!

auto *CanonicalIVIncrement = auto *CanonicalIVIncrement =

cast<VPInstruction>(CanonicalIVPHI->getBackedgeValue()); cast<VPInstruction>(CanonicalIVPHI->getBackedgeValue());

AyalUnsubmitted

Done

Should replicate recipes be handled next to handling widen memory recipes above?

Ayal: Should replicate recipes be handled next to handling widen memory recipes above?

fhahnAuthorUnsubmitted

Done

We still need to count them for verification

fhahn: We still need to count them for verification

AyalUnsubmitted

Done

nit: place simpler if !isLiveIn case first?

Ayal: nit: place simpler if !isLiveIn case first?

fhahnAuthorUnsubmitted

Done

Done, thanks!

fhahn: Done, thanks!

// TODO: Check if dropping the flags is needed if // TODO: Check if dropping the flags is needed if

// !DataAndControlFlowWithoutRuntimeCheck. // !DataAndControlFlowWithoutRuntimeCheck.

CanonicalIVIncrement->dropPoisonGeneratingFlags(); CanonicalIVIncrement->dropPoisonGeneratingFlags();

DebugLoc DL = CanonicalIVIncrement->getDebugLoc(); DebugLoc DL = CanonicalIVIncrement->getDebugLoc();

AyalUnsubmitted

Done

Better assert than continue? Here ProcessedRecipes was already bumped, but should all MinBWs members correspond to Integer types, of distinct (smaller) size, whether live-in or not?

Ayal: Better assert than continue? Here ProcessedRecipes was already bumped, but should all MinBWs…

fhahnAuthorUnsubmitted

Done

Turned isIntegerTy into assert but retained size check as there entries where the sizes are the same (e.g. for truncs).

fhahn: Turned `isIntegerTy` into assert but retained size check as there entries where the sizes are…

AyalUnsubmitted

Done

nit: ResTy >> OldResTy, ResSizeInBits >> OldResSizeInBits

Ayal: nit: `ResTy` >> `OldResTy`, `ResSizeInBits` >> `OldResSizeInBits`

fhahnAuthorUnsubmitted

Done

Renamed, thanks!

fhahn: Renamed, thanks!

// We can't use StartV directly in the ActiveLaneMask VPInstruction, since // We can't use StartV directly in the ActiveLaneMask VPInstruction, since

AyalUnsubmitted

Done

Is it possible for MinBWs not to contain Op's live-in IR value in this case?

Ayal: Is it possible for MinBWs not to contain Op's live-in IR value in this case?

fhahnAuthorUnsubmitted

Done

Yes, MinBWs only contains instructions, but not other values like arguments. Added a clarifying assert.

fhahn: Yes, MinBWs only contains instructions, but not other values like arguments. Added a clarifying…

AyalUnsubmitted

Done

#ifndef NDEBUG

- bool IsContained =

- MinBWs.contains(dyn_cast<Instruction>(Op->getLiveInIRValue()));

+ auto *OpInst = dyn_cast<Instruction>(Op->getLiveInIRValue());

+ bool IsContained = MinBWs.contains(OpInst);

+ assert((!OpInst || IsContained) && "...");

ProcessedRecipes += IsContained;

- assert((IsContained || !isa<Instruction>(Op->getLiveInIRValue())) &&

"All processed instructions should be contained in MinBWs.");

nit

Ayal: nit

// we have to take unrolling into account. Each part needs to start at // we have to take unrolling into account. Each part needs to start at

// Part * VF // Part * VF

AyalUnsubmitted

Done

assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?"); here instead of below?

Ayal: `assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?");` here instead of below?

fhahnAuthorUnsubmitted

Done

Done, and also removed continue

fhahn: Done, and also removed continue

auto *VecPreheader = cast<VPBasicBlock>(TopRegion->getSinglePredecessor()); auto *VecPreheader = cast<VPBasicBlock>(TopRegion->getSinglePredecessor());

VPBuilder Builder(VecPreheader); VPBuilder Builder(VecPreheader);

AyalUnsubmitted

Done

Note that truncations of live-ins could also be inserted before R, thereby leaving the treatment of live-ins to debugging only, and leaving their LICM and commoning to a subsequent VPlan cleanup pass, along with trunc-zext foldings.

Ayal: Note that truncations of live-ins could also be inserted before R, thereby leaving the…

fhahnAuthorUnsubmitted

Done

Yep, for now it is simpler and results in a smaller test diff to do it directly there as it is not only LICM but also very simple CSE

fhahn: Yep, for now it is simpler and results in a smaller test diff to do it directly there as it is…

// Create the ActiveLaneMask instruction using the correct start values. // Create the ActiveLaneMask instruction using the correct start values.

VPValue *TC = Plan.getTripCount(); VPValue *TC = Plan.getTripCount();

AyalUnsubmitted

Done

nit: VPC >> OldExt, Opc >> OldOpc?

Ayal: nit: `VPC` >> `OldExt`, `Opc` >> `OldOpc`?

fhahnAuthorUnsubmitted

Done

This code is now gone, handled by recipe simplification.

fhahn: This code is now gone, handled by recipe simplification.

VPValue *TripCount, *IncrementValue; VPValue *TripCount, *IncrementValue;

AyalUnsubmitted

Done

This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other?

OTOH, better handle Trunc here as well? Is it handled well below?

Ayal: This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other? OTOH…

fhahnAuthorUnsubmitted

Done

Thanks, changed to if. I don't think Trunc is handled explicitly in the latest version.

fhahn: Thanks, changed to `if`. I don't think Trunc is handled explicitly in the latest version.

AyalUnsubmitted

Not Done

Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its operand shrunk to the smaller bitwidth, effectively turning it into a ZExt?

Ayal: Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its…

if (!DataAndControlFlowWithoutRuntimeCheck) { if (!DataAndControlFlowWithoutRuntimeCheck) {

// When the loop is guarded by a runtime overflow check for the loop // When the loop is guarded by a runtime overflow check for the loop

// induction variable increment by VF, we can increment the value before // induction variable increment by VF, we can increment the value before

AyalUnsubmitted

Done

Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.

Ayal: Comment is obsolete here - dealt with new type being equal to operand type, which should result…

fhahnAuthorUnsubmitted

Done

Code is gone now

fhahn: Code is gone now

// the get.active.lane mask and use the unmodified tripcount. // the get.active.lane mask and use the unmodified tripcount.

AyalUnsubmitted

Done

// SExt/Zext is redundant - stick with its operand.
?

Ayal: `// SExt/Zext is redundant - stick with its operand.` ?

fhahnAuthorUnsubmitted

Done

this check has been moved up and is not needed any longer.

fhahn: this check has been moved up and is not needed any longer.

AyalUnsubmitted

Done

// SExt/Zext is redundant - stick with its operand.

- Instruction::CastOps Opcode = VPC->getOpcode();

+ Instruction::CastOps NewOpc = Opc;

VPValue *Op = R.getOperand(0);

Ayal: ?

fhahnAuthorUnsubmitted

Done

Code now gone.

fhahn: Code now gone.

IncrementValue = CanonicalIVIncrement; IncrementValue = CanonicalIVIncrement;

TripCount = TC; TripCount = TC;

} else { } else {

// When avoiding a runtime check, the active.lane.mask inside the loop // When avoiding a runtime check, the active.lane.mask inside the loop

AyalUnsubmitted

Done

nit: C >> NewCast?

If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?

Ayal: nit: `C` >> `NewCast`? If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o…

fhahnAuthorUnsubmitted

Done

Code gone now.

fhahn: Code gone now.

// uses a modified trip count and the induction variable increment is // uses a modified trip count and the induction variable increment is

// done after the active.lane.mask intrinsic is called. // done after the active.lane.mask intrinsic is called.

IncrementValue = CanonicalIVPHI; IncrementValue = CanonicalIVPHI;

AyalUnsubmitted

Done

Place assert earlier?

Ayal: Place assert earlier?

fhahnAuthorUnsubmitted

Done

moved up,, thanks!

fhahn: moved up,, thanks!

TripCount = Builder.createNaryOp(VPInstruction::CalculateTripCountMinusVF, TripCount = Builder.createNaryOp(VPInstruction::CalculateTripCountMinusVF,

{TC}, DL); {TC}, DL);

} }

AyalUnsubmitted

Done

auto *C = new VPWidenCastRecipe(Opcode, Op, NewResTy);

- C->insertBefore(&R);

- ResultVPV->replaceAllUsesWith(C);

+ C->insertBefore(&VPC);

+ VPC->replaceAllUsesWith(C);

continue;

Ayal:

fhahnAuthorUnsubmitted

Done

adjusted, thanks!

fhahn: adjusted, thanks!

auto *EntryIncrement = Builder.createOverflowingOp( auto *EntryIncrement = Builder.createOverflowingOp(

VPInstruction::CanonicalIVIncrementForPart, {StartV}, {false, false}, DL, VPInstruction::CanonicalIVIncrementForPart, {StartV}, {false, false}, DL,

"index.part.next"); "index.part.next");

// Create the active lane mask instruction in the VPlan preheader. // Create the active lane mask instruction in the VPlan preheader.

auto *EntryALM = auto *EntryALM =

Builder.createNaryOp(VPInstruction::ActiveLaneMask, {EntryIncrement, TC}, Builder.createNaryOp(VPInstruction::ActiveLaneMask, {EntryIncrement, TC},

DL, "active.lane.mask.entry"); DL, "active.lane.mask.entry");

// Now create the ActiveLaneMaskPhi recipe in the main loop using the // Now create the ActiveLaneMaskPhi recipe in the main loop using the

// preheader ActiveLaneMask instruction. // preheader ActiveLaneMask instruction.

AyalUnsubmitted

Done

This means the size of all operands is equal to NewResSizeInBits, can this be?

Ayal: This means the size of all operands is equal to NewResSizeInBits, can this be?

fhahnAuthorUnsubmitted

Done

There are cases where a Zext narrowed earlier is used as operand here, so the tie is already adjusted.

fhahn: There are cases where a Zext narrowed earlier is used as operand here, so the tie is already…

AyalUnsubmitted

Not Done

Maybe worth a comment.

Ayal: Maybe worth a comment.

auto LaneMaskPhi = new VPActiveLaneMaskPHIRecipe(EntryALM, DebugLoc()); auto LaneMaskPhi = new VPActiveLaneMaskPHIRecipe(EntryALM, DebugLoc());

LaneMaskPhi->insertAfter(CanonicalIVPHI); LaneMaskPhi->insertAfter(CanonicalIVPHI);

// Create the active lane mask for the next iteration of the loop before the // Create the active lane mask for the next iteration of the loop before the

// original terminator. // original terminator.

AyalUnsubmitted

Done

auto *Shrunk = new VPWidenCastRecipe(Instruction::Trunc, Op, NewResTy);

- R.setOperand(Idx, Shrunk);

Shrunk->insertBefore(&R);

+ R.setOperand(Idx, Shrunk);

}

if (auto *VPW = dyn_cast<VPRecipeWithIRFlags>(&R))

nit: keep consistent with above.

Ayal: nit: keep consistent with above.

fhahnAuthorUnsubmitted

Done

Adjusted, thanks!

fhahn: Adjusted, thanks!

VPRecipeBase *OriginalTerminator = EB->getTerminator(); VPRecipeBase *OriginalTerminator = EB->getTerminator();

Builder.setInsertPoint(OriginalTerminator); Builder.setInsertPoint(OriginalTerminator);

auto *InLoopIncrement = auto *InLoopIncrement =

Builder.createOverflowingOp(VPInstruction::CanonicalIVIncrementForPart, Builder.createOverflowingOp(VPInstruction::CanonicalIVIncrementForPart,

{IncrementValue}, {false, false}, DL); {IncrementValue}, {false, false}, DL);

auto *ALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask, auto *ALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,

{InLoopIncrement, TripCount}, DL, {InLoopIncrement, TripCount}, DL,

"active.lane.mask.next"); "active.lane.mask.next");

LaneMaskPhi->addOperand(ALM); LaneMaskPhi->addOperand(ALM);

AyalUnsubmitted

Done

auto *Ext = new VPWidenCastRecipe(Instruction::ZExt, ResultVPV, ResTy);

- ResultVPV->replaceAllUsesWith(Ext);

- Ext->setOperand(0, ResultVPV);

Ext->insertAfter(&R);

+ Ext->setOperand(0, ResultVPV);

+ ResultVPV->replaceAllUsesWith(Ext);

}

nit: keep consistent with above.

Ayal: nit: keep consistent with above.

fhahnAuthorUnsubmitted

Done

reordered, thanks!

fhahn: reordered, thanks!

// Replace the original terminator with BranchOnCond. We have to invert the // Replace the original terminator with BranchOnCond. We have to invert the

// mask here because a true condition means jumping to the exit block. // mask here because a true condition means jumping to the exit block.

auto *NotMask = Builder.createNot(ALM, DL); auto *NotMask = Builder.createNot(ALM, DL);

Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL); Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL);

OriginalTerminator->eraseFromParent(); OriginalTerminator->eraseFromParent();

return LaneMaskPhi; return LaneMaskPhi;

} }

▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll

Show All 32 Lines
; CHECK-NEXT: [[TMP3:%.*]] = zext <16 x i8> [[WIDE_LOAD2]] to <16 x i16>		; CHECK-NEXT: [[TMP3:%.*]] = zext <16 x i8> [[WIDE_LOAD2]] to <16 x i16>
; CHECK-NEXT: [[TMP4:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i16>		; CHECK-NEXT: [[TMP4:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
; CHECK-NEXT: [[TMP5:%.*]] = mul nuw <16 x i16> [[TMP3]], [[TMP4]]		; CHECK-NEXT: [[TMP5:%.*]] = mul nuw <16 x i16> [[TMP3]], [[TMP4]]
; CHECK-NEXT: [[TMP6:%.*]] = lshr <16 x i16> [[TMP5]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		; CHECK-NEXT: [[TMP6:%.*]] = lshr <16 x i16> [[TMP5]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP7:%.*]] = trunc <16 x i16> [[TMP6]] to <16 x i8>		; CHECK-NEXT: [[TMP7:%.*]] = trunc <16 x i16> [[TMP6]] to <16 x i8>
; CHECK-NEXT: store <16 x i8> [[TMP7]], ptr [[TMP2]], align 1		; CHECK-NEXT: store <16 x i8> [[TMP7]], ptr [[TMP2]], align 1
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]		; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD3:%.*]] = load <16 x i8>, ptr [[TMP8]], align 1		; CHECK-NEXT: [[WIDE_LOAD3:%.*]] = load <16 x i8>, ptr [[TMP8]], align 1
; CHECK-NEXT: [[TMP9:%.*]] = zext <16 x i8> [[WIDE_LOAD3]] to <16 x i16>		; CHECK-NEXT: [[TMP9:%.*]] = zext <16 x i8> [[WIDE_LOAD3]] to <16 x i16>
; CHECK-NEXT: [[TMP10:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i16>		; CHECK-NEXT: [[TMP10:%.*]] = mul nuw <16 x i16> [[TMP9]], [[TMP4]]
		AyalUnsubmitted Not Done Reply Inline Actions hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>, originally both TMP4 and TMP10. Ayal: hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>…
; CHECK-NEXT: [[TMP11:%.*]] = mul nuw <16 x i16> [[TMP9]], [[TMP10]]		; CHECK-NEXT: [[TMP11:%.*]] = lshr <16 x i16> [[TMP10]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP12:%.*]] = lshr <16 x i16> [[TMP11]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		; CHECK-NEXT: [[TMP12:%.*]] = trunc <16 x i16> [[TMP11]] to <16 x i8>
; CHECK-NEXT: [[TMP13:%.*]] = trunc <16 x i16> [[TMP12]] to <16 x i8>		; CHECK-NEXT: store <16 x i8> [[TMP12]], ptr [[TMP8]], align 1
; CHECK-NEXT: store <16 x i8> [[TMP13]], ptr [[TMP8]], align 1
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[TMP0]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_VEC]], [[TMP0]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
; CHECK: vec.epilog.iter.check:		; CHECK: vec.epilog.iter.check:
; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 8		; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 8
; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK_NOT_NOT:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0		; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK_NOT_NOT:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK_NOT_NOT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]		; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK_NOT_NOT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
; CHECK: vec.epilog.ph:		; CHECK: vec.epilog.ph:
; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]		; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; CHECK-NEXT: [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967288		; CHECK-NEXT: [[N_VEC5:%.*]] = and i64 [[TMP0]], 4294967288
; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK: vec.epilog.vector.body:		; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[INDEX7:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX7:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX7]]		; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX7]]
; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <8 x i8>, ptr [[TMP15]], align 1		; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <8 x i8>, ptr [[TMP14]], align 1
; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX7]]		; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX7]]
; CHECK-NEXT: [[WIDE_LOAD9:%.*]] = load <8 x i8>, ptr [[TMP16]], align 1		; CHECK-NEXT: [[WIDE_LOAD9:%.*]] = load <8 x i8>, ptr [[TMP15]], align 1
; CHECK-NEXT: [[TMP17:%.*]] = zext <8 x i8> [[WIDE_LOAD9]] to <8 x i16>		; CHECK-NEXT: [[TMP16:%.*]] = zext <8 x i8> [[WIDE_LOAD9]] to <8 x i16>
; CHECK-NEXT: [[TMP18:%.*]] = zext <8 x i8> [[WIDE_LOAD8]] to <8 x i16>		; CHECK-NEXT: [[TMP17:%.*]] = zext <8 x i8> [[WIDE_LOAD8]] to <8 x i16>
; CHECK-NEXT: [[TMP19:%.*]] = mul nuw <8 x i16> [[TMP17]], [[TMP18]]		; CHECK-NEXT: [[TMP18:%.*]] = mul nuw <8 x i16> [[TMP16]], [[TMP17]]
		AyalUnsubmitted Not Done Reply Inline Actions Spotted and removed duplicate zext of WIDE_LOAD8. Ayal: Spotted and removed duplicate zext of WIDE_LOAD8.
; CHECK-NEXT: [[TMP20:%.*]] = lshr <8 x i16> [[TMP19]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		; CHECK-NEXT: [[TMP19:%.*]] = lshr <8 x i16> [[TMP18]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP21:%.*]] = trunc <8 x i16> [[TMP20]] to <8 x i8>		; CHECK-NEXT: [[TMP20:%.*]] = trunc <8 x i16> [[TMP19]] to <8 x i8>
; CHECK-NEXT: store <8 x i8> [[TMP21]], ptr [[TMP16]], align 1		; CHECK-NEXT: store <8 x i8> [[TMP20]], ptr [[TMP15]], align 1
; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX7]]		; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX7]]
; CHECK-NEXT: [[WIDE_LOAD10:%.*]] = load <8 x i8>, ptr [[TMP22]], align 1		; CHECK-NEXT: [[WIDE_LOAD10:%.*]] = load <8 x i8>, ptr [[TMP21]], align 1
; CHECK-NEXT: [[TMP23:%.*]] = zext <8 x i8> [[WIDE_LOAD10]] to <8 x i16>		; CHECK-NEXT: [[TMP22:%.*]] = zext <8 x i8> [[WIDE_LOAD10]] to <8 x i16>
; CHECK-NEXT: [[TMP24:%.*]] = zext <8 x i8> [[WIDE_LOAD8]] to <8 x i16>		; CHECK-NEXT: [[TMP23:%.*]] = mul nuw <8 x i16> [[TMP22]], [[TMP17]]
; CHECK-NEXT: [[TMP25:%.*]] = mul nuw <8 x i16> [[TMP23]], [[TMP24]]		; CHECK-NEXT: [[TMP24:%.*]] = lshr <8 x i16> [[TMP23]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP26:%.*]] = lshr <8 x i16> [[TMP25]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		; CHECK-NEXT: [[TMP25:%.*]] = trunc <8 x i16> [[TMP24]] to <8 x i8>
; CHECK-NEXT: [[TMP27:%.*]] = trunc <8 x i16> [[TMP26]] to <8 x i8>		; CHECK-NEXT: store <8 x i8> [[TMP25]], ptr [[TMP21]], align 1
; CHECK-NEXT: store <8 x i8> [[TMP27]], ptr [[TMP22]], align 1
; CHECK-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 8		; CHECK-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 8
; CHECK-NEXT: [[TMP28:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC5]]		; CHECK-NEXT: [[TMP26:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC5]]
; CHECK-NEXT: br i1 [[TMP28]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP26]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
; CHECK: vec.epilog.middle.block:		; CHECK: vec.epilog.middle.block:
; CHECK-NEXT: [[CMP_N6:%.*]] = icmp eq i64 [[N_VEC5]], [[TMP0]]		; CHECK-NEXT: [[CMP_N6:%.*]] = icmp eq i64 [[N_VEC5]], [[TMP0]]
; CHECK-NEXT: br i1 [[CMP_N6]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N6]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]
; CHECK: vec.epilog.scalar.ph:		; CHECK: vec.epilog.scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC5]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC5]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.cond.cleanup.loopexit:		; CHECK: for.cond.cleanup.loopexit:
; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]		; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]		; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDVARS_IV]]		; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDVARS_IV]]
; CHECK-NEXT: [[TMP29:%.*]] = load i8, ptr [[ARRAYIDX]], align 1		; CHECK-NEXT: [[TMP27:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP29]] to i32		; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP27]] to i32
; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDVARS_IV]]		; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDVARS_IV]]
; CHECK-NEXT: [[TMP30:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1		; CHECK-NEXT: [[TMP28:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
; CHECK-NEXT: [[CONV3:%.*]] = zext i8 [[TMP30]] to i32		; CHECK-NEXT: [[CONV3:%.*]] = zext i8 [[TMP28]] to i32
; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV3]], [[CONV]]		; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV3]], [[CONV]]
; CHECK-NEXT: [[SHR_26:%.*]] = lshr i32 [[MUL]], 8		; CHECK-NEXT: [[SHR_26:%.*]] = lshr i32 [[MUL]], 8
; CHECK-NEXT: [[CONV4:%.*]] = trunc i32 [[SHR_26]] to i8		; CHECK-NEXT: [[CONV4:%.*]] = trunc i32 [[SHR_26]] to i8
; CHECK-NEXT: store i8 [[CONV4]], ptr [[ARRAYIDX2]], align 1		; CHECK-NEXT: store i8 [[CONV4]], ptr [[ARRAYIDX2]], align 1
; CHECK-NEXT: [[ARRAYIDX8:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDVARS_IV]]		; CHECK-NEXT: [[ARRAYIDX8:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDVARS_IV]]
; CHECK-NEXT: [[TMP31:%.*]] = load i8, ptr [[ARRAYIDX8]], align 1		; CHECK-NEXT: [[TMP29:%.*]] = load i8, ptr [[ARRAYIDX8]], align 1
; CHECK-NEXT: [[CONV9:%.*]] = zext i8 [[TMP31]] to i32		; CHECK-NEXT: [[CONV9:%.*]] = zext i8 [[TMP29]] to i32
; CHECK-NEXT: [[MUL10:%.*]] = mul nuw nsw i32 [[CONV9]], [[CONV]]		; CHECK-NEXT: [[MUL10:%.*]] = mul nuw nsw i32 [[CONV9]], [[CONV]]
; CHECK-NEXT: [[SHR11_27:%.*]] = lshr i32 [[MUL10]], 8		; CHECK-NEXT: [[SHR11_27:%.*]] = lshr i32 [[MUL10]], 8
; CHECK-NEXT: [[CONV12:%.*]] = trunc i32 [[SHR11_27]] to i8		; CHECK-NEXT: [[CONV12:%.*]] = trunc i32 [[SHR11_27]] to i8
; CHECK-NEXT: store i8 [[CONV12]], ptr [[ARRAYIDX8]], align 1		; CHECK-NEXT: store i8 [[CONV12]], ptr [[ARRAYIDX8]], align 1
; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1		; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32		; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
Show All 36 Lines	for.body: ; preds = %for.body.preheader, %for.body
br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body		br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
}		}


define void @test_shrink_zext_in_preheader(ptr noalias %src, ptr noalias %dst, i32 %A, i16 %B) {		define void @test_shrink_zext_in_preheader(ptr noalias %src, ptr noalias %dst, i32 %A, i16 %B) {
; CHECK-LABEL: define void @test_shrink_zext_in_preheader		; CHECK-LABEL: define void @test_shrink_zext_in_preheader
; CHECK-SAME: (ptr noalias [[SRC:%.]], ptr noalias [[DST:%.]], i32 [[A:%.]], i16 [[B:%.]]) {		; CHECK-SAME: (ptr noalias [[SRC:%.]], ptr noalias [[DST:%.]], i32 [[A:%.]], i16 [[B:%.]]) {
; CHECK-NEXT: iter.check:		; CHECK-NEXT: iter.check:
		; CHECK-NEXT: [[CONV10:%.*]] = zext i16 [[B]] to i32
		AyalUnsubmitted Done Reply Inline Actions This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop. Ayal: This testcase stores the 2nd least significant byte of a 32b product (of two invariant values…
		fhahnAuthorUnsubmitted Done Reply Inline Actions still more work to do :) Arguably the invariant instructions are artificial, in the regular pipeline, no invariant instructions should remain. fhahn: still more work to do :) Arguably the invariant instructions are artificial, in the regular…
; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]		; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
; CHECK: vector.main.loop.iter.check:		; CHECK: vector.main.loop.iter.check:
; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[A]], i64 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[A]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> poison, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> poison, <16 x i32> zeroinitializer
; CHECK-NEXT: [[TMP0:%.*]] = insertelement <16 x i16> undef, i16 [[B]], i64 0		; CHECK-NEXT: [[TMP0:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <16 x i16> [[TMP0]], <16 x i16> poison, <16 x i32> zeroinitializer		; CHECK-NEXT: [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
		AyalUnsubmitted Done Reply Inline Actions BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2? Ayal: BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?
		fhahnAuthorUnsubmitted Done Reply Inline Actions The latest version avoids truncating the same value twice. fhahn: The latest version avoids truncating the same value twice.
		AyalUnsubmitted Not Done Reply Inline Actions Duplicated TMP0 and TMP1 still here? Ayal: Duplicated TMP0 and TMP1 still here?
		fhahnAuthorUnsubmitted Done Reply Inline Actions They were due to redundant casts being added for Live-in values, fixed by checking in VPWidenCastRecipe::execute for now, with a FIXME to address this with explicit unrolling. fhahn: They were due to redundant casts being added for Live-in values, fixed by checking in…
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <16 x i32> poison, i32 [[CONV10]], i64 0
		AyalUnsubmitted Done Reply Inline Actions Both insertelement's now use poison. Ayal: Both insertelement's now use poison.
		fhahnAuthorUnsubmitted Done Reply Inline Actions I think the use of undef is a leftover that wasn't updated; it should be poison. fhahn: I think the use of undef is a leftover that wasn't updated; it should be poison.
		; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT1]], <16 x i32> poison, <16 x i32> zeroinitializer
		; CHECK-NEXT: [[TMP2:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT2]] to <16 x i16>
		; CHECK-NEXT: [[TMP3:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT2]] to <16 x i16>
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>		; CHECK-NEXT: [[TMP4:%.*]] = mul <16 x i16> [[TMP0]], [[TMP2]]
; CHECK-NEXT: [[TMP2:%.*]] = mul <16 x i16> [[BROADCAST_SPLAT2]], [[TMP1]]		; CHECK-NEXT: [[TMP5:%.*]] = mul <16 x i16> [[TMP1]], [[TMP3]]
		AyalUnsubmitted Done Reply Inline Actions BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2? Ayal: BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?
		fhahnAuthorUnsubmitted Done Reply Inline Actions The latest version avoids truncating the same value twice. fhahn: The latest version avoids truncating the same value twice.
		AyalUnsubmitted Not Done Reply Inline Actions Still seeing duplicate TMP2 and TMP3? Ayal: Still seeing duplicate TMP2 and TMP3?
; CHECK-NEXT: [[TMP3:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
; CHECK-NEXT: [[TMP4:%.*]] = mul <16 x i16> [[BROADCAST_SPLAT2]], [[TMP3]]
; CHECK-NEXT: [[TMP5:%.*]] = lshr <16 x i16> [[TMP2]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP6:%.*]] = lshr <16 x i16> [[TMP4]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		; CHECK-NEXT: [[TMP6:%.*]] = lshr <16 x i16> [[TMP4]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP7:%.*]] = trunc <16 x i16> [[TMP5]] to <16 x i8>		; CHECK-NEXT: [[TMP7:%.*]] = lshr <16 x i16> [[TMP5]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP8:%.*]] = trunc <16 x i16> [[TMP6]] to <16 x i8>		; CHECK-NEXT: [[TMP8:%.*]] = trunc <16 x i16> [[TMP6]] to <16 x i8>
; CHECK-NEXT: [[TMP9:%.*]] = sext i32 [[INDEX]] to i64		; CHECK-NEXT: [[TMP9:%.*]] = trunc <16 x i16> [[TMP7]] to <16 x i8>
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP9]]		; CHECK-NEXT: [[TMP10:%.*]] = sext i32 [[INDEX]] to i64
; CHECK-NEXT: store <16 x i8> [[TMP7]], ptr [[TMP10]], align 1		; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP10]]
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i64 16
; CHECK-NEXT: store <16 x i8> [[TMP8]], ptr [[TMP11]], align 1		; CHECK-NEXT: store <16 x i8> [[TMP8]], ptr [[TMP11]], align 1
		; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i64 16
		; CHECK-NEXT: store <16 x i8> [[TMP9]], ptr [[TMP12]], align 1
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 32		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 32
; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i32 [[INDEX_NEXT]], 992		; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], 992
; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: br i1 false, label [[EXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]		; CHECK-NEXT: br i1 false, label [[EXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
; CHECK: vec.epilog.iter.check:		; CHECK: vec.epilog.iter.check:
; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]		; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
; CHECK: vec.epilog.ph:		; CHECK: vec.epilog.ph:
; CHECK-NEXT: [[TMP13:%.*]] = insertelement <8 x i16> undef, i16 [[B]], i64 0
; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[INDEX4:%.]] = phi i32 [ 992, [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT9:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP14:%.*]] = trunc i32 [[A]] to i16		; CHECK-NEXT: [[TMP14:%.*]] = trunc i32 [[A]] to i16
; CHECK-NEXT: [[TMP15:%.*]] = insertelement <8 x i16> undef, i16 [[TMP14]], i64 0		; CHECK-NEXT: [[TMP15:%.*]] = insertelement <8 x i16> undef, i16 [[TMP14]], i64 0
; CHECK-NEXT: [[TMP16:%.*]] = mul <8 x i16> [[TMP15]], [[TMP13]]		; CHECK-NEXT: [[TMP16:%.*]] = insertelement <8 x i16> undef, i16 [[B]], i64 0
		AyalUnsubmitted Not Done Reply Inline Actions Trunc & insertelement LICM'd from vec.epilog.vector.body to vec.epilog.ph. Ayal: Trunc & insertelement LICM'd from vec.epilog.vector.body to vec.epilog.ph.
; CHECK-NEXT: [[TMP17:%.*]] = lshr <8 x i16> [[TMP16]], <i16 8, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0>		; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK-NEXT: [[TMP18:%.*]] = trunc <8 x i16> [[TMP17]] to <8 x i8>		; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <8 x i8> [[TMP18]], <8 x i8> poison, <8 x i32> zeroinitializer		; CHECK-NEXT: [[INDEX7:%.]] = phi i32 [ 992, [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT8:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP20:%.*]] = sext i32 [[INDEX4]] to i64		; CHECK-NEXT: [[TMP17:%.*]] = mul <8 x i16> [[TMP15]], [[TMP16]]
; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP20]]		; CHECK-NEXT: [[TMP18:%.*]] = lshr <8 x i16> [[TMP17]], <i16 8, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0, i16 0>
; CHECK-NEXT: store <8 x i8> [[TMP19]], ptr [[TMP21]], align 1		; CHECK-NEXT: [[TMP19:%.*]] = trunc <8 x i16> [[TMP18]] to <8 x i8>
; CHECK-NEXT: [[INDEX_NEXT9]] = add nuw i32 [[INDEX4]], 8		; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <8 x i8> [[TMP19]], <8 x i8> poison, <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP22:%.*]] = icmp eq i32 [[INDEX_NEXT9]], 1000		; CHECK-NEXT: [[TMP21:%.*]] = sext i32 [[INDEX7]] to i64
; CHECK-NEXT: br i1 [[TMP22]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]		; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP21]]
		; CHECK-NEXT: store <8 x i8> [[TMP20]], ptr [[TMP22]], align 1
		; CHECK-NEXT: [[INDEX_NEXT8]] = add nuw i32 [[INDEX7]], 8
		; CHECK-NEXT: [[TMP23:%.*]] = icmp eq i32 [[INDEX_NEXT8]], 1000
		; CHECK-NEXT: br i1 [[TMP23]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: vec.epilog.middle.block:		; CHECK: vec.epilog.middle.block:
; CHECK-NEXT: br i1 true, label [[EXIT]], label [[VEC_EPILOG_SCALAR_PH]]		; CHECK-NEXT: br i1 true, label [[EXIT]], label [[VEC_EPILOG_SCALAR_PH]]
; CHECK: vec.epilog.scalar.ph:		; CHECK: vec.epilog.scalar.ph:
; CHECK-NEXT: br label [[LOOP:%.*]]		; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:		; CHECK: loop:
; CHECK-NEXT: br i1 poison, label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]		; CHECK-NEXT: br i1 poison, label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]
; CHECK: exit:		; CHECK: exit:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
Show All 23 Lines
define void @test_shrink_select(ptr noalias %src, ptr noalias %dst, i32 %A, i1 %c) {		define void @test_shrink_select(ptr noalias %src, ptr noalias %dst, i32 %A, i1 %c) {
; CHECK-LABEL: define void @test_shrink_select		; CHECK-LABEL: define void @test_shrink_select
; CHECK-SAME: (ptr noalias [[SRC:%.]], ptr noalias [[DST:%.]], i32 [[A:%.]], i1 [[C:%.]]) {		; CHECK-SAME: (ptr noalias [[SRC:%.]], ptr noalias [[DST:%.]], i32 [[A:%.]], i1 [[C:%.]]) {
; CHECK-NEXT: iter.check:		; CHECK-NEXT: iter.check:
; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]		; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
; CHECK: vector.main.loop.iter.check:		; CHECK: vector.main.loop.iter.check:
; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
		; CHECK-NEXT: [[TMP0:%.*]] = trunc i32 [[A]] to i16
		; CHECK-NEXT: [[TMP1:%.*]] = insertelement <16 x i16> undef, i16 [[TMP0]], i64 0
		AyalUnsubmitted Not Done Reply Inline Actions ditto. Ayal: ditto.
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = trunc i32 [[A]] to i16
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <16 x i16> undef, i16 [[TMP0]], i64 0
; CHECK-NEXT: [[TMP2:%.*]] = mul <16 x i16> [[TMP1]], <i16 99, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison>		; CHECK-NEXT: [[TMP2:%.*]] = mul <16 x i16> [[TMP1]], <i16 99, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison>
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <16 x i16> [[TMP2]], <16 x i16> poison, <16 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <16 x i16> [[TMP2]], <16 x i16> poison, <16 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = lshr <16 x i16> [[TMP3]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		; CHECK-NEXT: [[TMP4:%.*]] = lshr <16 x i16> [[TMP3]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP5:%.*]] = select i1 [[C]], <16 x i16> [[TMP4]], <16 x i16> [[TMP3]]		; CHECK-NEXT: [[TMP5:%.*]] = select i1 [[C]], <16 x i16> [[TMP4]], <16 x i16> [[TMP3]]
; CHECK-NEXT: [[TMP6:%.*]] = trunc <16 x i16> [[TMP5]] to <16 x i8>		; CHECK-NEXT: [[TMP6:%.*]] = trunc <16 x i16> [[TMP5]] to <16 x i8>
; CHECK-NEXT: [[TMP7:%.*]] = sext i32 [[INDEX]] to i64		; CHECK-NEXT: [[TMP7:%.*]] = sext i32 [[INDEX]] to i64
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP7]]		; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP7]]
; CHECK-NEXT: store <16 x i8> [[TMP6]], ptr [[TMP8]], align 1		; CHECK-NEXT: store <16 x i8> [[TMP6]], ptr [[TMP8]], align 1
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i32 [[INDEX_NEXT]], 992		; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i32 [[INDEX_NEXT]], 992
; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: br i1 false, label [[EXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]		; CHECK-NEXT: br i1 false, label [[EXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
; CHECK: vec.epilog.iter.check:		; CHECK: vec.epilog.iter.check:
; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]		; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
; CHECK: vec.epilog.ph:		; CHECK: vec.epilog.ph:
; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[INDEX2:%.]] = phi i32 [ 992, [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT5:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP10:%.*]] = trunc i32 [[A]] to i16		; CHECK-NEXT: [[TMP10:%.*]] = trunc i32 [[A]] to i16
; CHECK-NEXT: [[TMP11:%.*]] = insertelement <8 x i16> undef, i16 [[TMP10]], i64 0		; CHECK-NEXT: [[TMP11:%.*]] = insertelement <8 x i16> undef, i16 [[TMP10]], i64 0
		; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
		; CHECK: vec.epilog.vector.body:
		; CHECK-NEXT: [[INDEX3:%.]] = phi i32 [ 992, [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT4:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP12:%.*]] = mul <8 x i16> [[TMP11]], <i16 99, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison>		; CHECK-NEXT: [[TMP12:%.*]] = mul <8 x i16> [[TMP11]], <i16 99, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison, i16 poison>
; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <8 x i16> [[TMP12]], <8 x i16> poison, <8 x i32> zeroinitializer		; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <8 x i16> [[TMP12]], <8 x i16> poison, <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP14:%.*]] = lshr <8 x i16> [[TMP13]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		; CHECK-NEXT: [[TMP14:%.*]] = lshr <8 x i16> [[TMP13]], <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
; CHECK-NEXT: [[TMP15:%.*]] = select i1 [[C]], <8 x i16> [[TMP14]], <8 x i16> [[TMP13]]		; CHECK-NEXT: [[TMP15:%.*]] = select i1 [[C]], <8 x i16> [[TMP14]], <8 x i16> [[TMP13]]
; CHECK-NEXT: [[TMP16:%.*]] = trunc <8 x i16> [[TMP15]] to <8 x i8>		; CHECK-NEXT: [[TMP16:%.*]] = trunc <8 x i16> [[TMP15]] to <8 x i8>
; CHECK-NEXT: [[TMP17:%.*]] = sext i32 [[INDEX2]] to i64		; CHECK-NEXT: [[TMP17:%.*]] = sext i32 [[INDEX3]] to i64
; CHECK-NEXT: [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP17]]		; CHECK-NEXT: [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[TMP17]]
; CHECK-NEXT: store <8 x i8> [[TMP16]], ptr [[TMP18]], align 1		; CHECK-NEXT: store <8 x i8> [[TMP16]], ptr [[TMP18]], align 1
; CHECK-NEXT: [[INDEX_NEXT5]] = add nuw i32 [[INDEX2]], 8		; CHECK-NEXT: [[INDEX_NEXT4]] = add nuw i32 [[INDEX3]], 8
; CHECK-NEXT: [[TMP19:%.*]] = icmp eq i32 [[INDEX_NEXT5]], 1000		; CHECK-NEXT: [[TMP19:%.*]] = icmp eq i32 [[INDEX_NEXT4]], 1000
; CHECK-NEXT: br i1 [[TMP19]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP19]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
; CHECK: vec.epilog.middle.block:		; CHECK: vec.epilog.middle.block:
; CHECK-NEXT: br i1 true, label [[EXIT]], label [[VEC_EPILOG_SCALAR_PH]]		; CHECK-NEXT: br i1 true, label [[EXIT]], label [[VEC_EPILOG_SCALAR_PH]]
; CHECK: vec.epilog.scalar.ph:		; CHECK: vec.epilog.scalar.ph:
; CHECK-NEXT: br label [[LOOP:%.*]]		; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:		; CHECK: loop:
; CHECK-NEXT: br i1 poison, label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP10:![0-9]+]]		; CHECK-NEXT: br i1 poison, label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP10:![0-9]+]]
; CHECK: exit:		; CHECK: exit:
Show All 23 Lines

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll

	Show All 21 Lines
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1
	; CHECK-NEXT: [[TMP4:%.*]] = add <16 x i8> [[WIDE_LOAD]], <i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2>			; CHECK-NEXT: [[TMP4:%.*]] = add <16 x i8> [[WIDE_LOAD]], <i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2>
	; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i8> [[TMP4]] to <16 x i32>			; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP1]]
				AyalUnsubmitted Not Done Reply Inline Actions Fold zext-trunc pair, several such cases follow. Ayal: Fold zext-trunc pair, several such cases follow.
	; CHECK-NEXT: [[TMP6:%.*]] = trunc <16 x i32> [[TMP5]] to <16 x i8>			; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP1]]			; CHECK-NEXT: store <16 x i8> [[TMP4]], ptr [[TMP6]], align 1
	; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0
	; CHECK-NEXT: store <16 x i8> [[TMP6]], ptr [[TMP8]], align 1
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
	; CHECK: vec.epilog.iter.check:			; CHECK: vec.epilog.iter.check:
	; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8			; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
	; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]			; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
	; CHECK: vec.epilog.ph:			; CHECK: vec.epilog.ph:
	; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]			; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
	; CHECK-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP0]], 8			; CHECK-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP0]], 8
	; CHECK-NEXT: [[N_VEC3:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF2]]			; CHECK-NEXT: [[N_VEC3:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF2]]
	; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
	; CHECK: vec.epilog.vector.body:			; CHECK: vec.epilog.vector.body:
	; CHECK-NEXT: [[INDEX5:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT7:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX5:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT7:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP10:%.*]] = add i64 [[INDEX5]], 0			; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX5]], 0
	; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP10]]			; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP8]]
	; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0			; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD6:%.*]] = load <8 x i8>, ptr [[TMP12]], align 1			; CHECK-NEXT: [[WIDE_LOAD6:%.*]] = load <8 x i8>, ptr [[TMP10]], align 1
	; CHECK-NEXT: [[TMP13:%.*]] = add <8 x i8> [[WIDE_LOAD6]], <i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2>			; CHECK-NEXT: [[TMP11:%.*]] = add <8 x i8> [[WIDE_LOAD6]], <i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2>
	; CHECK-NEXT: [[TMP14:%.*]] = zext <8 x i8> [[TMP13]] to <8 x i32>			; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP8]]
	; CHECK-NEXT: [[TMP15:%.*]] = trunc <8 x i32> [[TMP14]] to <8 x i8>			; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds i8, ptr [[TMP12]], i32 0
	; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP10]]			; CHECK-NEXT: store <8 x i8> [[TMP11]], ptr [[TMP13]], align 1
	; CHECK-NEXT: [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[TMP16]], i32 0
	; CHECK-NEXT: store <8 x i8> [[TMP15]], ptr [[TMP17]], align 1
	; CHECK-NEXT: [[INDEX_NEXT7]] = add nuw i64 [[INDEX5]], 8			; CHECK-NEXT: [[INDEX_NEXT7]] = add nuw i64 [[INDEX5]], 8
	; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT7]], [[N_VEC3]]			; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i64 [[INDEX_NEXT7]], [[N_VEC3]]
	; CHECK-NEXT: br i1 [[TMP18]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP14]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
	; CHECK: vec.epilog.middle.block:			; CHECK: vec.epilog.middle.block:
	; CHECK-NEXT: [[CMP_N4:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC3]]			; CHECK-NEXT: [[CMP_N4:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC3]]
	; CHECK-NEXT: br i1 [[CMP_N4]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N4]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]
	; CHECK: vec.epilog.scalar.ph:			; CHECK: vec.epilog.scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC3]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC3]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup.loopexit:			; CHECK: for.cond.cleanup.loopexit:
	; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]			; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP19:%.*]] = load i8, ptr [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP15:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP19]] to i32			; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP15]] to i32
	; CHECK-NEXT: [[ADD:%.*]] = add nuw nsw i32 [[CONV]], 2			; CHECK-NEXT: [[ADD:%.*]] = add nuw nsw i32 [[CONV]], 2
	; CHECK-NEXT: [[CONV1:%.*]] = trunc i32 [[ADD]] to i8			; CHECK-NEXT: [[CONV1:%.*]] = trunc i32 [[ADD]] to i8
	; CHECK-NEXT: [[ARRAYIDX3:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX3:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: store i8 [[CONV1]], ptr [[ARRAYIDX3]], align 1			; CHECK-NEXT: store i8 [[CONV1]], ptr [[ARRAYIDX3]], align 1
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32			; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[LEN]]			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[LEN]]
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	▲ Show 20 Lines • Show All 135 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i16, ptr [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i16, ptr [[TMP2]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <8 x i16>, ptr [[TMP3]], align 2			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <8 x i16>, ptr [[TMP3]], align 2
	; CHECK-NEXT: [[TMP4:%.*]] = add <8 x i16> [[WIDE_LOAD]], <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>			; CHECK-NEXT: [[TMP4:%.*]] = add <8 x i16> [[WIDE_LOAD]], <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>
	; CHECK-NEXT: [[TMP5:%.*]] = zext <8 x i16> [[TMP4]] to <8 x i32>			; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP6:%.*]] = trunc <8 x i32> [[TMP5]] to <8 x i16>			; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[TMP1]]			; CHECK-NEXT: store <8 x i16> [[TMP4]], ptr [[TMP6]], align 2
	; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i16, ptr [[TMP7]], i32 0
	; CHECK-NEXT: store <8 x i16> [[TMP6]], ptr [[TMP8]], align 2
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup.loopexit:			; CHECK: for.cond.cleanup.loopexit:
	; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]			; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.*]] = load i16, ptr [[ARRAYIDX]], align 2			; CHECK-NEXT: [[TMP8:%.*]] = load i16, ptr [[ARRAYIDX]], align 2
	; CHECK-NEXT: [[CONV8:%.*]] = zext i16 [[TMP10]] to i32			; CHECK-NEXT: [[CONV8:%.*]] = zext i16 [[TMP8]] to i32
	; CHECK-NEXT: [[ADD:%.*]] = add nuw nsw i32 [[CONV8]], 2			; CHECK-NEXT: [[ADD:%.*]] = add nuw nsw i32 [[CONV8]], 2
	; CHECK-NEXT: [[CONV1:%.*]] = trunc i32 [[ADD]] to i16			; CHECK-NEXT: [[CONV1:%.*]] = trunc i32 [[ADD]] to i16
	; CHECK-NEXT: [[ARRAYIDX3:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX3:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: store i16 [[CONV1]], ptr [[ARRAYIDX3]], align 2			; CHECK-NEXT: store i16 [[CONV1]], ptr [[ARRAYIDX3]], align 2
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32			; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[LEN]]			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[LEN]]
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
	Show All 37 Lines
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], 16			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], 16
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1
				AyalUnsubmitted Done Reply Inline Actions We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>, but fail to fold a similar one following the add-2's? Ayal: We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>, but…
				fhahnAuthorUnsubmitted Done Reply Inline Actions folding now happens all in simplifyRecieps, should handle this now fhahn: folding now happens all in simplifyRecieps, should handle this now
				AyalUnsubmitted Not Done Reply Inline Actions The one following the add-2's is also folded now. Ayal: The one following the add-2's is also folded now.
	; CHECK-NEXT: [[TMP4:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i16>			; CHECK-NEXT: [[TMP4:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
	; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i16> [[TMP4]] to <16 x i32>			; CHECK-NEXT: [[TMP5:%.*]] = add <16 x i16> [[TMP4]], <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>
	; CHECK-NEXT: [[TMP6:%.*]] = trunc <16 x i32> [[TMP5]] to <16 x i16>			; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP7:%.*]] = add <16 x i16> [[TMP6]], <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>			; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[TMP6]], i32 0
	; CHECK-NEXT: [[TMP8:%.*]] = zext <16 x i16> [[TMP7]] to <16 x i32>			; CHECK-NEXT: store <16 x i16> [[TMP5]], ptr [[TMP7]], align 2
	; CHECK-NEXT: [[TMP9:%.*]] = trunc <16 x i32> [[TMP8]] to <16 x i16>
	; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i16, ptr [[TMP10]], i32 0
	; CHECK-NEXT: store <16 x i16> [[TMP9]], ptr [[TMP11]], align 2
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
	; CHECK: vec.epilog.iter.check:			; CHECK: vec.epilog.iter.check:
	; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8			; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
	; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]			; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
	; CHECK: vec.epilog.ph:			; CHECK: vec.epilog.ph:
	; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]			; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
	; CHECK-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP0]], 8			; CHECK-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP0]], 8
	; CHECK-NEXT: [[N_VEC3:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF2]]			; CHECK-NEXT: [[N_VEC3:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF2]]
	; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
	; CHECK: vec.epilog.vector.body:			; CHECK: vec.epilog.vector.body:
	; CHECK-NEXT: [[INDEX5:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT7:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX5:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT7:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP13:%.*]] = add i64 [[INDEX5]], 0			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX5]], 0
	; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP13]]			; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[TMP14]], i32 0			; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD6:%.*]] = load <8 x i8>, ptr [[TMP15]], align 1			; CHECK-NEXT: [[WIDE_LOAD6:%.*]] = load <8 x i8>, ptr [[TMP11]], align 1
	; CHECK-NEXT: [[TMP16:%.*]] = zext <8 x i8> [[WIDE_LOAD6]] to <8 x i16>			; CHECK-NEXT: [[TMP12:%.*]] = zext <8 x i8> [[WIDE_LOAD6]] to <8 x i16>
	; CHECK-NEXT: [[TMP17:%.*]] = zext <8 x i16> [[TMP16]] to <8 x i32>			; CHECK-NEXT: [[TMP13:%.*]] = add <8 x i16> [[TMP12]], <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>
				AyalUnsubmitted Not Done Reply Inline Actions We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is not an NFC patch), but still retain the pair of <8 x i16> => <8 x i32> => <8 x i16> after it - missed MinBW/trunc-zext opportunity? Ayal: We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is…
				AyalUnsubmitted Not Done Reply Inline Actions Other pair also folded now. Ayal: Other pair also folded now.
	; CHECK-NEXT: [[TMP18:%.*]] = trunc <8 x i32> [[TMP17]] to <8 x i16>			; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP19:%.*]] = add <8 x i16> [[TMP18]], <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>			; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i16, ptr [[TMP14]], i32 0
	; CHECK-NEXT: [[TMP20:%.*]] = zext <8 x i16> [[TMP19]] to <8 x i32>			; CHECK-NEXT: store <8 x i16> [[TMP13]], ptr [[TMP15]], align 2
	; CHECK-NEXT: [[TMP21:%.*]] = trunc <8 x i32> [[TMP20]] to <8 x i16>
	; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[TMP13]]
	; CHECK-NEXT: [[TMP23:%.*]] = getelementptr inbounds i16, ptr [[TMP22]], i32 0
	; CHECK-NEXT: store <8 x i16> [[TMP21]], ptr [[TMP23]], align 2
	; CHECK-NEXT: [[INDEX_NEXT7]] = add nuw i64 [[INDEX5]], 8			; CHECK-NEXT: [[INDEX_NEXT7]] = add nuw i64 [[INDEX5]], 8
	; CHECK-NEXT: [[TMP24:%.*]] = icmp eq i64 [[INDEX_NEXT7]], [[N_VEC3]]			; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT7]], [[N_VEC3]]
	; CHECK-NEXT: br i1 [[TMP24]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP16]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
	; CHECK: vec.epilog.middle.block:			; CHECK: vec.epilog.middle.block:
	; CHECK-NEXT: [[CMP_N4:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC3]]			; CHECK-NEXT: [[CMP_N4:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC3]]
	; CHECK-NEXT: br i1 [[CMP_N4]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N4]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]
	; CHECK: vec.epilog.scalar.ph:			; CHECK: vec.epilog.scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC3]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC3]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup.loopexit:			; CHECK: for.cond.cleanup.loopexit:
	; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]			; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP25:%.*]] = load i8, ptr [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP17:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP25]] to i32			; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP17]] to i32
	; CHECK-NEXT: [[ADD:%.*]] = add nuw nsw i32 [[CONV]], 2			; CHECK-NEXT: [[ADD:%.*]] = add nuw nsw i32 [[CONV]], 2
	; CHECK-NEXT: [[CONV1:%.*]] = trunc i32 [[ADD]] to i16			; CHECK-NEXT: [[CONV1:%.*]] = trunc i32 [[ADD]] to i16
	; CHECK-NEXT: [[ARRAYIDX3:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX3:%.*]] = getelementptr inbounds i16, ptr [[Q]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: store i16 [[CONV1]], ptr [[ARRAYIDX3]], align 2			; CHECK-NEXT: store i16 [[CONV1]], ptr [[ARRAYIDX3]], align 2
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32			; CHECK-NEXT: [[LFTR_WIDEIV:%.*]] = trunc i64 [[INDVARS_IV_NEXT]] to i32
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[LEN]]			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[LFTR_WIDEIV]], [[LEN]]
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
	▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
	; CHECK: vector.main.loop.iter.check:			; CHECK: vector.main.loop.iter.check:
	; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP0]], 16			; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP0]], 16
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], 16			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], 16
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[CONV13]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[CONV13]], i64 0
	; CHECK-NEXT: [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLATINSERT]] to <16 x i8>			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> poison, <16 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer			; CHECK-NEXT: [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i8>
				AyalUnsubmitted Done Reply Inline Actions Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result? Ayal: Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their…
				fhahnAuthorUnsubmitted Done Reply Inline Actions I think there's nothing we can do about that; we first need to splat the value when generating code, but InstCombine should take care of that. fhahn: I think there's nothing we can do about that; we first need to splat the value when generating…
				AyalUnsubmitted Not Done Reply Inline Actions Worth testing with a subsequent instCombine, to ensure pessimization is avoided? Ayal: Worth testing with a subsequent instCombine, to ensure pessimization is avoided?
	; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[BROADCAST_SPLAT]] to <16 x i32>
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <16 x i32> poison, i32 [[CONV11]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <16 x i32> poison, i32 [[CONV11]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = trunc <16 x i32> [[BROADCAST_SPLATINSERT2]] to <16 x i8>			; CHECK-NEXT: [[BROADCAST_SPLAT3:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT2]], <16 x i32> poison, <16 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLAT3:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> zeroinitializer			; CHECK-NEXT: [[TMP2:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT3]] to <16 x i8>
	; CHECK-NEXT: [[TMP4:%.*]] = zext <16 x i8> [[BROADCAST_SPLAT3]] to <16 x i32>
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP5]]			; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[TMP6]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP8:%.*]] = load <16 x i8>, ptr [[TMP7]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP5]], align 1
	; CHECK-NEXT: [[TMP9:%.*]] = shl <16 x i8> [[TMP8]], <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>			; CHECK-NEXT: [[TMP6:%.*]] = shl <16 x i8> [[WIDE_LOAD]], <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>
	; CHECK-NEXT: [[TMP10:%.*]] = zext <16 x i8> [[TMP9]] to <16 x i32>			; CHECK-NEXT: [[TMP7:%.*]] = add <16 x i8> [[TMP6]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>
				AyalUnsubmitted Done Reply Inline Actions Many zext-trunc pairs left to collect. Ayal: Many zext-trunc pairs left to collect.
				fhahnAuthorUnsubmitted Done Reply Inline Actions Should be better cleaned up now fhahn: Should be better cleaned up now
				AyalUnsubmitted Not Done Reply Inline Actions Indeed looks like it! Ayal: Indeed looks like it!
	; CHECK-NEXT: [[TMP11:%.*]] = trunc <16 x i32> [[TMP10]] to <16 x i8>			; CHECK-NEXT: [[TMP8:%.*]] = or <16 x i8> [[WIDE_LOAD]], <i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51>
	; CHECK-NEXT: [[TMP12:%.*]] = add <16 x i8> [[TMP11]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>			; CHECK-NEXT: [[TMP9:%.*]] = mul <16 x i8> [[TMP8]], <i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60>
	; CHECK-NEXT: [[TMP13:%.*]] = zext <16 x i8> [[TMP12]] to <16 x i32>			; CHECK-NEXT: [[TMP10:%.*]] = and <16 x i8> [[TMP7]], [[TMP1]]
	; CHECK-NEXT: [[TMP14:%.*]] = or <16 x i8> [[TMP8]], <i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51>			; CHECK-NEXT: [[TMP11:%.*]] = and <16 x i8> [[TMP9]], <i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4>
	; CHECK-NEXT: [[TMP15:%.*]] = zext <16 x i8> [[TMP14]] to <16 x i32>			; CHECK-NEXT: [[TMP12:%.*]] = xor <16 x i8> [[TMP11]], [[TMP2]]
	; CHECK-NEXT: [[TMP16:%.*]] = trunc <16 x i32> [[TMP15]] to <16 x i8>			; CHECK-NEXT: [[TMP13:%.*]] = mul <16 x i8> [[TMP12]], [[TMP10]]
	; CHECK-NEXT: [[TMP17:%.*]] = mul <16 x i8> [[TMP16]], <i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60>			; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP18:%.*]] = zext <16 x i8> [[TMP17]] to <16 x i32>			; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[TMP14]], i32 0
	; CHECK-NEXT: [[TMP19:%.*]] = trunc <16 x i32> [[TMP13]] to <16 x i8>			; CHECK-NEXT: store <16 x i8> [[TMP13]], ptr [[TMP15]], align 1
	; CHECK-NEXT: [[TMP20:%.*]] = trunc <16 x i32> [[TMP2]] to <16 x i8>
	AyalUnsubmitted Not Done Reply Inline Actions Above trunc of TMP2 is redundant along with its zext in the ph. Ayal: Above trunc of TMP2 is redundant along with its zext in the ph.
	; CHECK-NEXT: [[TMP21:%.*]] = and <16 x i8> [[TMP19]], [[TMP20]]
	; CHECK-NEXT: [[TMP22:%.*]] = zext <16 x i8> [[TMP21]] to <16 x i32>
	; CHECK-NEXT: [[TMP23:%.*]] = trunc <16 x i32> [[TMP18]] to <16 x i8>
	; CHECK-NEXT: [[TMP24:%.*]] = and <16 x i8> [[TMP23]], <i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4>
	; CHECK-NEXT: [[TMP25:%.*]] = zext <16 x i8> [[TMP24]] to <16 x i32>
	; CHECK-NEXT: [[TMP26:%.*]] = trunc <16 x i32> [[TMP25]] to <16 x i8>
	; CHECK-NEXT: [[TMP27:%.*]] = trunc <16 x i32> [[TMP4]] to <16 x i8>
	AyalUnsubmitted Not Done Reply Inline Actions Above trunc of TMP4 is redundant along with its zext in the ph. Ayal: Above trunc of TMP4 is redundant along with its zext in the ph.
	; CHECK-NEXT: [[TMP28:%.*]] = xor <16 x i8> [[TMP26]], [[TMP27]]
	; CHECK-NEXT: [[TMP29:%.*]] = zext <16 x i8> [[TMP28]] to <16 x i32>
	; CHECK-NEXT: [[TMP30:%.*]] = trunc <16 x i32> [[TMP29]] to <16 x i8>
	; CHECK-NEXT: [[TMP31:%.*]] = trunc <16 x i32> [[TMP22]] to <16 x i8>
	; CHECK-NEXT: [[TMP32:%.*]] = mul <16 x i8> [[TMP30]], [[TMP31]]
	; CHECK-NEXT: [[TMP33:%.*]] = zext <16 x i8> [[TMP32]] to <16 x i32>
	; CHECK-NEXT: [[TMP34:%.*]] = trunc <16 x i32> [[TMP33]] to <16 x i8>
	; CHECK-NEXT: [[TMP35:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP5]]
	; CHECK-NEXT: [[TMP36:%.*]] = getelementptr inbounds i8, ptr [[TMP35]], i32 0
	; CHECK-NEXT: store <16 x i8> [[TMP34]], ptr [[TMP36]], align 1
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP37:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP37]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP15:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP15:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
	; CHECK: vec.epilog.iter.check:			; CHECK: vec.epilog.iter.check:
	; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8			; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
	; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]			; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
	; CHECK: vec.epilog.ph:			; CHECK: vec.epilog.ph:
	; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]			; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
	; CHECK-NEXT: [[N_MOD_VF4:%.*]] = urem i64 [[TMP0]], 8			; CHECK-NEXT: [[N_MOD_VF4:%.*]] = urem i64 [[TMP0]], 8
	; CHECK-NEXT: [[N_VEC5:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF4]]			; CHECK-NEXT: [[N_VEC5:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF4]]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT8:%.*]] = insertelement <8 x i32> poison, i32 [[CONV13]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <8 x i32> poison, i32 [[CONV13]], i64 0
	; CHECK-NEXT: [[TMP38:%.*]] = trunc <8 x i32> [[BROADCAST_SPLATINSERT8]] to <8 x i8>			; CHECK-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT7]], <8 x i32> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLAT9:%.*]] = shufflevector <8 x i8> [[TMP38]], <8 x i8> poison, <8 x i32> zeroinitializer			; CHECK-NEXT: [[TMP17:%.*]] = trunc <8 x i32> [[BROADCAST_SPLAT8]] to <8 x i8>
	; CHECK-NEXT: [[TMP39:%.*]] = zext <8 x i8> [[BROADCAST_SPLAT9]] to <8 x i32>			; CHECK-NEXT: [[BROADCAST_SPLATINSERT9:%.*]] = insertelement <8 x i32> poison, i32 [[CONV11]], i64 0
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT10:%.*]] = insertelement <8 x i32> poison, i32 [[CONV11]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLAT10:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT9]], <8 x i32> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP40:%.*]] = trunc <8 x i32> [[BROADCAST_SPLATINSERT10]] to <8 x i8>			; CHECK-NEXT: [[TMP18:%.*]] = trunc <8 x i32> [[BROADCAST_SPLAT10]] to <8 x i8>
	; CHECK-NEXT: [[BROADCAST_SPLAT11:%.*]] = shufflevector <8 x i8> [[TMP40]], <8 x i8> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP41:%.*]] = zext <8 x i8> [[BROADCAST_SPLAT11]] to <8 x i32>
	; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
	; CHECK: vec.epilog.vector.body:			; CHECK: vec.epilog.vector.body:
	; CHECK-NEXT: [[INDEX7:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT12:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX11:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT13:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP42:%.*]] = add i64 [[INDEX7]], 0			; CHECK-NEXT: [[TMP19:%.*]] = add i64 [[INDEX11]], 0
	; CHECK-NEXT: [[TMP43:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP42]]			; CHECK-NEXT: [[TMP20:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[TMP19]]
	; CHECK-NEXT: [[TMP44:%.*]] = getelementptr inbounds i8, ptr [[TMP43]], i32 0			; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i8, ptr [[TMP20]], i32 0
	; CHECK-NEXT: [[TMP45:%.*]] = load <8 x i8>, ptr [[TMP44]], align 1			; CHECK-NEXT: [[WIDE_LOAD12:%.*]] = load <8 x i8>, ptr [[TMP21]], align 1
	; CHECK-NEXT: [[TMP46:%.*]] = shl <8 x i8> [[TMP45]], <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>			; CHECK-NEXT: [[TMP22:%.*]] = shl <8 x i8> [[WIDE_LOAD12]], <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>
	; CHECK-NEXT: [[TMP47:%.*]] = zext <8 x i8> [[TMP46]] to <8 x i32>			; CHECK-NEXT: [[TMP23:%.*]] = add <8 x i8> [[TMP22]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>
	; CHECK-NEXT: [[TMP48:%.*]] = trunc <8 x i32> [[TMP47]] to <8 x i8>			; CHECK-NEXT: [[TMP24:%.*]] = or <8 x i8> [[WIDE_LOAD12]], <i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51>
	; CHECK-NEXT: [[TMP49:%.*]] = add <8 x i8> [[TMP48]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>			; CHECK-NEXT: [[TMP25:%.*]] = mul <8 x i8> [[TMP24]], <i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60>
	; CHECK-NEXT: [[TMP50:%.*]] = zext <8 x i8> [[TMP49]] to <8 x i32>			; CHECK-NEXT: [[TMP26:%.*]] = and <8 x i8> [[TMP23]], [[TMP17]]
	; CHECK-NEXT: [[TMP51:%.*]] = or <8 x i8> [[TMP45]], <i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51>			; CHECK-NEXT: [[TMP27:%.*]] = and <8 x i8> [[TMP25]], <i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4>
	; CHECK-NEXT: [[TMP52:%.*]] = zext <8 x i8> [[TMP51]] to <8 x i32>			; CHECK-NEXT: [[TMP28:%.*]] = xor <8 x i8> [[TMP27]], [[TMP18]]
	; CHECK-NEXT: [[TMP53:%.*]] = trunc <8 x i32> [[TMP52]] to <8 x i8>			; CHECK-NEXT: [[TMP29:%.*]] = mul <8 x i8> [[TMP28]], [[TMP26]]
	; CHECK-NEXT: [[TMP54:%.*]] = mul <8 x i8> [[TMP53]], <i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60>			; CHECK-NEXT: [[TMP30:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP19]]
	; CHECK-NEXT: [[TMP55:%.*]] = zext <8 x i8> [[TMP54]] to <8 x i32>			; CHECK-NEXT: [[TMP31:%.*]] = getelementptr inbounds i8, ptr [[TMP30]], i32 0
	; CHECK-NEXT: [[TMP56:%.*]] = trunc <8 x i32> [[TMP50]] to <8 x i8>			; CHECK-NEXT: store <8 x i8> [[TMP29]], ptr [[TMP31]], align 1
	; CHECK-NEXT: [[TMP57:%.*]] = trunc <8 x i32> [[TMP39]] to <8 x i8>			; CHECK-NEXT: [[INDEX_NEXT13]] = add nuw i64 [[INDEX11]], 8
	; CHECK-NEXT: [[TMP58:%.*]] = and <8 x i8> [[TMP56]], [[TMP57]]			; CHECK-NEXT: [[TMP32:%.*]] = icmp eq i64 [[INDEX_NEXT13]], [[N_VEC5]]
	; CHECK-NEXT: [[TMP59:%.*]] = zext <8 x i8> [[TMP58]] to <8 x i32>			; CHECK-NEXT: br i1 [[TMP32]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP16:![0-9]+]]
	; CHECK-NEXT: [[TMP60:%.*]] = trunc <8 x i32> [[TMP55]] to <8 x i8>
	; CHECK-NEXT: [[TMP61:%.*]] = and <8 x i8> [[TMP60]], <i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4>
	; CHECK-NEXT: [[TMP62:%.*]] = zext <8 x i8> [[TMP61]] to <8 x i32>
	; CHECK-NEXT: [[TMP63:%.*]] = trunc <8 x i32> [[TMP62]] to <8 x i8>
	; CHECK-NEXT: [[TMP64:%.*]] = trunc <8 x i32> [[TMP41]] to <8 x i8>
	; CHECK-NEXT: [[TMP65:%.*]] = xor <8 x i8> [[TMP63]], [[TMP64]]
	; CHECK-NEXT: [[TMP66:%.*]] = zext <8 x i8> [[TMP65]] to <8 x i32>
	; CHECK-NEXT: [[TMP67:%.*]] = trunc <8 x i32> [[TMP66]] to <8 x i8>
	; CHECK-NEXT: [[TMP68:%.*]] = trunc <8 x i32> [[TMP59]] to <8 x i8>
	; CHECK-NEXT: [[TMP69:%.*]] = mul <8 x i8> [[TMP67]], [[TMP68]]
	; CHECK-NEXT: [[TMP70:%.*]] = zext <8 x i8> [[TMP69]] to <8 x i32>
	; CHECK-NEXT: [[TMP71:%.*]] = trunc <8 x i32> [[TMP70]] to <8 x i8>
	; CHECK-NEXT: [[TMP72:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP42]]
	; CHECK-NEXT: [[TMP73:%.*]] = getelementptr inbounds i8, ptr [[TMP72]], i32 0
	; CHECK-NEXT: store <8 x i8> [[TMP71]], ptr [[TMP73]], align 1
	; CHECK-NEXT: [[INDEX_NEXT12]] = add nuw i64 [[INDEX7]], 8
	; CHECK-NEXT: [[TMP74:%.*]] = icmp eq i64 [[INDEX_NEXT12]], [[N_VEC5]]
	; CHECK-NEXT: br i1 [[TMP74]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP16:![0-9]+]]
	; CHECK: vec.epilog.middle.block:			; CHECK: vec.epilog.middle.block:
	; CHECK-NEXT: [[CMP_N6:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC5]]			; CHECK-NEXT: [[CMP_N6:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC5]]
	; CHECK-NEXT: br i1 [[CMP_N6]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N6]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]
	; CHECK: vec.epilog.scalar.ph:			; CHECK: vec.epilog.scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC5]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC5]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup.loopexit:			; CHECK: for.cond.cleanup.loopexit:
	; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]			; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[P]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP75:%.*]] = load i8, ptr [[ARRAYIDX]], align 1			; CHECK-NEXT: [[TMP33:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
	; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP75]] to i32			; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP33]] to i32
	; CHECK-NEXT: [[ADD:%.*]] = shl i32 [[CONV]], 4			; CHECK-NEXT: [[ADD:%.*]] = shl i32 [[CONV]], 4
	; CHECK-NEXT: [[CONV2:%.*]] = add nuw nsw i32 [[ADD]], 32			; CHECK-NEXT: [[CONV2:%.*]] = add nuw nsw i32 [[ADD]], 32
	; CHECK-NEXT: [[OR:%.*]] = or i32 [[CONV]], 51			; CHECK-NEXT: [[OR:%.*]] = or i32 [[CONV]], 51
	; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[OR]], 60			; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[OR]], 60
	; CHECK-NEXT: [[AND:%.*]] = and i32 [[CONV2]], [[CONV13]]			; CHECK-NEXT: [[AND:%.*]] = and i32 [[CONV2]], [[CONV13]]
	; CHECK-NEXT: [[MUL_MASKED:%.*]] = and i32 [[MUL]], 252			; CHECK-NEXT: [[MUL_MASKED:%.*]] = and i32 [[MUL]], 252
	; CHECK-NEXT: [[CONV17:%.*]] = xor i32 [[MUL_MASKED]], [[CONV11]]			; CHECK-NEXT: [[CONV17:%.*]] = xor i32 [[MUL_MASKED]], [[CONV11]]
	; CHECK-NEXT: [[MUL18:%.*]] = mul nuw nsw i32 [[CONV17]], [[AND]]			; CHECK-NEXT: [[MUL18:%.*]] = mul nuw nsw i32 [[CONV17]], [[AND]]
	▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
	; CHECK: vector.main.loop.iter.check:			; CHECK: vector.main.loop.iter.check:
	; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP0]], 16			; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP0]], 16
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], 16			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], 16
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[CONV13]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[CONV13]], i64 0
	; CHECK-NEXT: [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLATINSERT]] to <16 x i8>			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> poison, <16 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer			; CHECK-NEXT: [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i8>
	; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[BROADCAST_SPLAT]] to <16 x i32>
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <16 x i32> poison, i32 [[CONV11]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <16 x i32> poison, i32 [[CONV11]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = trunc <16 x i32> [[BROADCAST_SPLATINSERT2]] to <16 x i8>			; CHECK-NEXT: [[BROADCAST_SPLAT3:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT2]], <16 x i32> poison, <16 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLAT3:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> zeroinitializer			; CHECK-NEXT: [[TMP2:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT3]] to <16 x i8>
	; CHECK-NEXT: [[TMP4:%.*]] = zext <16 x i8> [[BROADCAST_SPLAT3]] to <16 x i32>
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[TMP5]]			; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[TMP6]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[TMP4]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i16>, ptr [[TMP7]], align 2			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i16>, ptr [[TMP5]], align 2
	; CHECK-NEXT: [[TMP8:%.*]] = trunc <16 x i16> [[WIDE_LOAD]] to <16 x i8>			; CHECK-NEXT: [[TMP6:%.*]] = trunc <16 x i16> [[WIDE_LOAD]] to <16 x i8>
	; CHECK-NEXT: [[TMP9:%.*]] = zext <16 x i8> [[TMP8]] to <16 x i32>			; CHECK-NEXT: [[TMP7:%.*]] = shl <16 x i8> [[TMP6]], <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>
	; CHECK-NEXT: [[TMP10:%.*]] = trunc <16 x i32> [[TMP9]] to <16 x i8>			; CHECK-NEXT: [[TMP8:%.*]] = add <16 x i8> [[TMP7]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>
	; CHECK-NEXT: [[TMP11:%.*]] = shl <16 x i8> [[TMP10]], <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>			; CHECK-NEXT: [[TMP9:%.*]] = and <16 x i8> [[TMP6]], <i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52>
	; CHECK-NEXT: [[TMP12:%.*]] = zext <16 x i8> [[TMP11]] to <16 x i32>			; CHECK-NEXT: [[TMP10:%.*]] = or <16 x i8> [[TMP9]], <i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51>
	; CHECK-NEXT: [[TMP13:%.*]] = trunc <16 x i32> [[TMP12]] to <16 x i8>			; CHECK-NEXT: [[TMP11:%.*]] = mul <16 x i8> [[TMP10]], <i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60>
	; CHECK-NEXT: [[TMP14:%.*]] = add <16 x i8> [[TMP13]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>			; CHECK-NEXT: [[TMP12:%.*]] = and <16 x i8> [[TMP8]], [[TMP1]]
	; CHECK-NEXT: [[TMP15:%.*]] = zext <16 x i8> [[TMP14]] to <16 x i32>			; CHECK-NEXT: [[TMP13:%.*]] = and <16 x i8> [[TMP11]], <i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4>
	; CHECK-NEXT: [[TMP16:%.*]] = and <16 x i8> [[TMP8]], <i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52>			; CHECK-NEXT: [[TMP14:%.*]] = xor <16 x i8> [[TMP13]], [[TMP2]]
	; CHECK-NEXT: [[TMP17:%.*]] = zext <16 x i8> [[TMP16]] to <16 x i32>			; CHECK-NEXT: [[TMP15:%.*]] = mul <16 x i8> [[TMP14]], [[TMP12]]
	; CHECK-NEXT: [[TMP18:%.*]] = trunc <16 x i32> [[TMP17]] to <16 x i8>			; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP19:%.*]] = or <16 x i8> [[TMP18]], <i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51>			; CHECK-NEXT: [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[TMP16]], i32 0
	; CHECK-NEXT: [[TMP20:%.*]] = zext <16 x i8> [[TMP19]] to <16 x i32>			; CHECK-NEXT: store <16 x i8> [[TMP15]], ptr [[TMP17]], align 1
	; CHECK-NEXT: [[TMP21:%.*]] = trunc <16 x i32> [[TMP20]] to <16 x i8>
	; CHECK-NEXT: [[TMP22:%.*]] = mul <16 x i8> [[TMP21]], <i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60>
	; CHECK-NEXT: [[TMP23:%.*]] = zext <16 x i8> [[TMP22]] to <16 x i32>
	; CHECK-NEXT: [[TMP24:%.*]] = trunc <16 x i32> [[TMP15]] to <16 x i8>
	; CHECK-NEXT: [[TMP25:%.*]] = trunc <16 x i32> [[TMP2]] to <16 x i8>
	; CHECK-NEXT: [[TMP26:%.*]] = and <16 x i8> [[TMP24]], [[TMP25]]
	; CHECK-NEXT: [[TMP27:%.*]] = zext <16 x i8> [[TMP26]] to <16 x i32>
	; CHECK-NEXT: [[TMP28:%.*]] = trunc <16 x i32> [[TMP23]] to <16 x i8>
	; CHECK-NEXT: [[TMP29:%.*]] = and <16 x i8> [[TMP28]], <i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4>
	; CHECK-NEXT: [[TMP30:%.*]] = zext <16 x i8> [[TMP29]] to <16 x i32>
	; CHECK-NEXT: [[TMP31:%.*]] = trunc <16 x i32> [[TMP30]] to <16 x i8>
	; CHECK-NEXT: [[TMP32:%.*]] = trunc <16 x i32> [[TMP4]] to <16 x i8>
	; CHECK-NEXT: [[TMP33:%.*]] = xor <16 x i8> [[TMP31]], [[TMP32]]
	; CHECK-NEXT: [[TMP34:%.*]] = zext <16 x i8> [[TMP33]] to <16 x i32>
	; CHECK-NEXT: [[TMP35:%.*]] = trunc <16 x i32> [[TMP34]] to <16 x i8>
	; CHECK-NEXT: [[TMP36:%.*]] = trunc <16 x i32> [[TMP27]] to <16 x i8>
	; CHECK-NEXT: [[TMP37:%.*]] = mul <16 x i8> [[TMP35]], [[TMP36]]
	; CHECK-NEXT: [[TMP38:%.*]] = zext <16 x i8> [[TMP37]] to <16 x i32>
	; CHECK-NEXT: [[TMP39:%.*]] = trunc <16 x i32> [[TMP38]] to <16 x i8>
	; CHECK-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP5]]
	; CHECK-NEXT: [[TMP41:%.*]] = getelementptr inbounds i8, ptr [[TMP40]], i32 0
	; CHECK-NEXT: store <16 x i8> [[TMP39]], ptr [[TMP41]], align 1
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP42:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP42]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP18:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP18:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
	; CHECK: vec.epilog.iter.check:			; CHECK: vec.epilog.iter.check:
	; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]			; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]
	; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8			; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
	; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]			; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
	; CHECK: vec.epilog.ph:			; CHECK: vec.epilog.ph:
	; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]			; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
	; CHECK-NEXT: [[N_MOD_VF4:%.*]] = urem i64 [[TMP0]], 8			; CHECK-NEXT: [[N_MOD_VF4:%.*]] = urem i64 [[TMP0]], 8
	; CHECK-NEXT: [[N_VEC5:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF4]]			; CHECK-NEXT: [[N_VEC5:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF4]]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT9:%.*]] = insertelement <8 x i32> poison, i32 [[CONV13]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <8 x i32> poison, i32 [[CONV13]], i64 0
	; CHECK-NEXT: [[TMP43:%.*]] = trunc <8 x i32> [[BROADCAST_SPLATINSERT9]] to <8 x i8>			; CHECK-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT7]], <8 x i32> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[BROADCAST_SPLAT10:%.*]] = shufflevector <8 x i8> [[TMP43]], <8 x i8> poison, <8 x i32> zeroinitializer			; CHECK-NEXT: [[TMP19:%.*]] = trunc <8 x i32> [[BROADCAST_SPLAT8]] to <8 x i8>
	; CHECK-NEXT: [[TMP44:%.*]] = zext <8 x i8> [[BROADCAST_SPLAT10]] to <8 x i32>			; CHECK-NEXT: [[BROADCAST_SPLATINSERT9:%.*]] = insertelement <8 x i32> poison, i32 [[CONV11]], i64 0
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT11:%.*]] = insertelement <8 x i32> poison, i32 [[CONV11]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLAT10:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT9]], <8 x i32> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP45:%.*]] = trunc <8 x i32> [[BROADCAST_SPLATINSERT11]] to <8 x i8>			; CHECK-NEXT: [[TMP20:%.*]] = trunc <8 x i32> [[BROADCAST_SPLAT10]] to <8 x i8>
	; CHECK-NEXT: [[BROADCAST_SPLAT12:%.*]] = shufflevector <8 x i8> [[TMP45]], <8 x i8> poison, <8 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP46:%.*]] = zext <8 x i8> [[BROADCAST_SPLAT12]] to <8 x i32>
	; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
	; CHECK: vec.epilog.vector.body:			; CHECK: vec.epilog.vector.body:
	; CHECK-NEXT: [[INDEX7:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT13:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX11:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT13:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP47:%.*]] = add i64 [[INDEX7]], 0			; CHECK-NEXT: [[TMP21:%.*]] = add i64 [[INDEX11]], 0
	; CHECK-NEXT: [[TMP48:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[TMP47]]			; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[TMP21]]
	; CHECK-NEXT: [[TMP49:%.*]] = getelementptr inbounds i16, ptr [[TMP48]], i32 0			; CHECK-NEXT: [[TMP23:%.*]] = getelementptr inbounds i16, ptr [[TMP22]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <8 x i16>, ptr [[TMP49]], align 2			; CHECK-NEXT: [[WIDE_LOAD12:%.*]] = load <8 x i16>, ptr [[TMP23]], align 2
	; CHECK-NEXT: [[TMP50:%.*]] = trunc <8 x i16> [[WIDE_LOAD8]] to <8 x i8>			; CHECK-NEXT: [[TMP24:%.*]] = trunc <8 x i16> [[WIDE_LOAD12]] to <8 x i8>
	; CHECK-NEXT: [[TMP51:%.*]] = zext <8 x i8> [[TMP50]] to <8 x i32>			; CHECK-NEXT: [[TMP25:%.*]] = shl <8 x i8> [[TMP24]], <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>
	; CHECK-NEXT: [[TMP52:%.*]] = trunc <8 x i32> [[TMP51]] to <8 x i8>			; CHECK-NEXT: [[TMP26:%.*]] = add <8 x i8> [[TMP25]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>
	; CHECK-NEXT: [[TMP53:%.*]] = shl <8 x i8> [[TMP52]], <i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4, i8 4>			; CHECK-NEXT: [[TMP27:%.*]] = and <8 x i8> [[TMP24]], <i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52>
	; CHECK-NEXT: [[TMP54:%.*]] = zext <8 x i8> [[TMP53]] to <8 x i32>			; CHECK-NEXT: [[TMP28:%.*]] = or <8 x i8> [[TMP27]], <i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51>
	; CHECK-NEXT: [[TMP55:%.*]] = trunc <8 x i32> [[TMP54]] to <8 x i8>			; CHECK-NEXT: [[TMP29:%.*]] = mul <8 x i8> [[TMP28]], <i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60>
	; CHECK-NEXT: [[TMP56:%.*]] = add <8 x i8> [[TMP55]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>			; CHECK-NEXT: [[TMP30:%.*]] = and <8 x i8> [[TMP26]], [[TMP19]]
	; CHECK-NEXT: [[TMP57:%.*]] = zext <8 x i8> [[TMP56]] to <8 x i32>			; CHECK-NEXT: [[TMP31:%.*]] = and <8 x i8> [[TMP29]], <i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4>
	; CHECK-NEXT: [[TMP58:%.*]] = and <8 x i8> [[TMP50]], <i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52, i8 -52>			; CHECK-NEXT: [[TMP32:%.*]] = xor <8 x i8> [[TMP31]], [[TMP20]]
	; CHECK-NEXT: [[TMP59:%.*]] = zext <8 x i8> [[TMP58]] to <8 x i32>			; CHECK-NEXT: [[TMP33:%.*]] = mul <8 x i8> [[TMP32]], [[TMP30]]
	; CHECK-NEXT: [[TMP60:%.*]] = trunc <8 x i32> [[TMP59]] to <8 x i8>			; CHECK-NEXT: [[TMP34:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP21]]
	; CHECK-NEXT: [[TMP61:%.*]] = or <8 x i8> [[TMP60]], <i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51, i8 51>			; CHECK-NEXT: [[TMP35:%.*]] = getelementptr inbounds i8, ptr [[TMP34]], i32 0
	; CHECK-NEXT: [[TMP62:%.*]] = zext <8 x i8> [[TMP61]] to <8 x i32>			; CHECK-NEXT: store <8 x i8> [[TMP33]], ptr [[TMP35]], align 1
	; CHECK-NEXT: [[TMP63:%.*]] = trunc <8 x i32> [[TMP62]] to <8 x i8>			; CHECK-NEXT: [[INDEX_NEXT13]] = add nuw i64 [[INDEX11]], 8
	; CHECK-NEXT: [[TMP64:%.*]] = mul <8 x i8> [[TMP63]], <i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60, i8 60>			; CHECK-NEXT: [[TMP36:%.*]] = icmp eq i64 [[INDEX_NEXT13]], [[N_VEC5]]
	; CHECK-NEXT: [[TMP65:%.*]] = zext <8 x i8> [[TMP64]] to <8 x i32>			; CHECK-NEXT: br i1 [[TMP36]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP19:![0-9]+]]
	; CHECK-NEXT: [[TMP66:%.*]] = trunc <8 x i32> [[TMP57]] to <8 x i8>
	; CHECK-NEXT: [[TMP67:%.*]] = trunc <8 x i32> [[TMP44]] to <8 x i8>
	; CHECK-NEXT: [[TMP68:%.*]] = and <8 x i8> [[TMP66]], [[TMP67]]
	; CHECK-NEXT: [[TMP69:%.*]] = zext <8 x i8> [[TMP68]] to <8 x i32>
	; CHECK-NEXT: [[TMP70:%.*]] = trunc <8 x i32> [[TMP65]] to <8 x i8>
	; CHECK-NEXT: [[TMP71:%.*]] = and <8 x i8> [[TMP70]], <i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4, i8 -4>
	; CHECK-NEXT: [[TMP72:%.*]] = zext <8 x i8> [[TMP71]] to <8 x i32>
	; CHECK-NEXT: [[TMP73:%.*]] = trunc <8 x i32> [[TMP72]] to <8 x i8>
	; CHECK-NEXT: [[TMP74:%.*]] = trunc <8 x i32> [[TMP46]] to <8 x i8>
	; CHECK-NEXT: [[TMP75:%.*]] = xor <8 x i8> [[TMP73]], [[TMP74]]
	; CHECK-NEXT: [[TMP76:%.*]] = zext <8 x i8> [[TMP75]] to <8 x i32>
	; CHECK-NEXT: [[TMP77:%.*]] = trunc <8 x i32> [[TMP76]] to <8 x i8>
	; CHECK-NEXT: [[TMP78:%.*]] = trunc <8 x i32> [[TMP69]] to <8 x i8>
	; CHECK-NEXT: [[TMP79:%.*]] = mul <8 x i8> [[TMP77]], [[TMP78]]
	; CHECK-NEXT: [[TMP80:%.*]] = zext <8 x i8> [[TMP79]] to <8 x i32>
	; CHECK-NEXT: [[TMP81:%.*]] = trunc <8 x i32> [[TMP80]] to <8 x i8>
	; CHECK-NEXT: [[TMP82:%.*]] = getelementptr inbounds i8, ptr [[Q]], i64 [[TMP47]]
	; CHECK-NEXT: [[TMP83:%.*]] = getelementptr inbounds i8, ptr [[TMP82]], i32 0
	; CHECK-NEXT: store <8 x i8> [[TMP81]], ptr [[TMP83]], align 1
	; CHECK-NEXT: [[INDEX_NEXT13]] = add nuw i64 [[INDEX7]], 8
	; CHECK-NEXT: [[TMP84:%.*]] = icmp eq i64 [[INDEX_NEXT13]], [[N_VEC5]]
	; CHECK-NEXT: br i1 [[TMP84]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP19:![0-9]+]]
	; CHECK: vec.epilog.middle.block:			; CHECK: vec.epilog.middle.block:
	; CHECK-NEXT: [[CMP_N6:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC5]]			; CHECK-NEXT: [[CMP_N6:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC5]]
	; CHECK-NEXT: br i1 [[CMP_N6]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N6]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[VEC_EPILOG_SCALAR_PH]]
	; CHECK: vec.epilog.scalar.ph:			; CHECK: vec.epilog.scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC5]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC5]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup.loopexit:			; CHECK: for.cond.cleanup.loopexit:
	; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]			; CHECK-NEXT: br label [[FOR_COND_CLEANUP]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i16, ptr [[P]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP85:%.*]] = load i16, ptr [[ARRAYIDX]], align 2			; CHECK-NEXT: [[TMP37:%.*]] = load i16, ptr [[ARRAYIDX]], align 2
	; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP85]] to i32			; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP37]] to i32
	; CHECK-NEXT: [[ADD:%.*]] = shl i32 [[CONV]], 4			; CHECK-NEXT: [[ADD:%.*]] = shl i32 [[CONV]], 4
	; CHECK-NEXT: [[CONV2:%.*]] = add nsw i32 [[ADD]], 32			; CHECK-NEXT: [[CONV2:%.*]] = add nsw i32 [[ADD]], 32
	; CHECK-NEXT: [[OR:%.*]] = and i32 [[CONV]], 204			; CHECK-NEXT: [[OR:%.*]] = and i32 [[CONV]], 204
	; CHECK-NEXT: [[CONV8:%.*]] = or i32 [[OR]], 51			; CHECK-NEXT: [[CONV8:%.*]] = or i32 [[OR]], 51
	; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV8]], 60			; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV8]], 60
	; CHECK-NEXT: [[AND:%.*]] = and i32 [[CONV2]], [[CONV13]]			; CHECK-NEXT: [[AND:%.*]] = and i32 [[CONV2]], [[CONV13]]
	; CHECK-NEXT: [[MUL_MASKED:%.*]] = and i32 [[MUL]], 252			; CHECK-NEXT: [[MUL_MASKED:%.*]] = and i32 [[MUL]], 252
	; CHECK-NEXT: [[CONV17:%.*]] = xor i32 [[MUL_MASKED]], [[CONV11]]			; CHECK-NEXT: [[CONV17:%.*]] = xor i32 [[MUL_MASKED]], [[CONV11]]
	▲ Show 20 Lines • Show All 209 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/type-shrinkage-insertelt.ll

	Show All 15 Lines
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 1			; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 1
	; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 2			; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 2
	; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 3			; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 3
	; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[TMP4]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i16>, ptr [[TMP5]], align 2			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i16>, ptr [[TMP5]], align 2
	; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i16> [[WIDE_LOAD]], <i16 10, i16 10, i16 10, i16 10>			; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i16> [[WIDE_LOAD]], <i16 10, i16 10, i16 10, i16 10>
	; CHECK-NEXT: [[TMP7:%.*]] = zext <4 x i16> [[TMP6]] to <4 x i32>			; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP8:%.*]] = trunc <4 x i32> [[TMP7]] to <4 x i16>			; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP2]]
	; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP2]]			; CHECK-NEXT: [[TMP11:%.*]] = load i64, ptr [[TMP7]], align 8
	; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP3]]			; CHECK-NEXT: [[TMP12:%.*]] = load i64, ptr [[TMP8]], align 8
	; CHECK-NEXT: [[TMP13:%.*]] = load i64, ptr [[TMP9]], align 8			; CHECK-NEXT: [[TMP13:%.*]] = load i64, ptr [[TMP9]], align 8
	; CHECK-NEXT: [[TMP14:%.*]] = load i64, ptr [[TMP10]], align 8			; CHECK-NEXT: [[TMP14:%.*]] = load i64, ptr [[TMP10]], align 8
	; CHECK-NEXT: [[TMP15:%.*]] = load i64, ptr [[TMP11]], align 8			; CHECK-NEXT: [[TMP15:%.*]] = ashr exact i64 [[TMP11]], 32
	; CHECK-NEXT: [[TMP16:%.*]] = load i64, ptr [[TMP12]], align 8			; CHECK-NEXT: [[TMP16:%.*]] = ashr exact i64 [[TMP12]], 32
	; CHECK-NEXT: [[TMP17:%.*]] = ashr exact i64 [[TMP13]], 32			; CHECK-NEXT: [[TMP17:%.*]] = ashr exact i64 [[TMP13]], 32
	; CHECK-NEXT: [[TMP18:%.*]] = ashr exact i64 [[TMP14]], 32			; CHECK-NEXT: [[TMP18:%.*]] = ashr exact i64 [[TMP14]], 32
	; CHECK-NEXT: [[TMP19:%.*]] = ashr exact i64 [[TMP15]], 32			; CHECK-NEXT: [[TMP19:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP15]]
	; CHECK-NEXT: [[TMP20:%.*]] = ashr exact i64 [[TMP16]], 32			; CHECK-NEXT: [[TMP20:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP16]]
	; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP17]]			; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP17]]
	; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP18]]			; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP18]]
	; CHECK-NEXT: [[TMP23:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP19]]			; CHECK-NEXT: [[TMP23:%.*]] = extractelement <4 x i16> [[TMP6]], i32 0
	; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP20]]			; CHECK-NEXT: store i16 [[TMP23]], ptr [[TMP19]], align 2
	; CHECK-NEXT: [[TMP25:%.*]] = extractelement <4 x i16> [[TMP8]], i32 0			; CHECK-NEXT: [[TMP24:%.*]] = extractelement <4 x i16> [[TMP6]], i32 1
				; CHECK-NEXT: store i16 [[TMP24]], ptr [[TMP20]], align 2
				; CHECK-NEXT: [[TMP25:%.*]] = extractelement <4 x i16> [[TMP6]], i32 2
	; CHECK-NEXT: store i16 [[TMP25]], ptr [[TMP21]], align 2			; CHECK-NEXT: store i16 [[TMP25]], ptr [[TMP21]], align 2
	; CHECK-NEXT: [[TMP26:%.*]] = extractelement <4 x i16> [[TMP8]], i32 1			; CHECK-NEXT: [[TMP26:%.*]] = extractelement <4 x i16> [[TMP6]], i32 3
	; CHECK-NEXT: store i16 [[TMP26]], ptr [[TMP22]], align 2			; CHECK-NEXT: store i16 [[TMP26]], ptr [[TMP22]], align 2
	; CHECK-NEXT: [[TMP27:%.*]] = extractelement <4 x i16> [[TMP8]], i32 2
	; CHECK-NEXT: store i16 [[TMP27]], ptr [[TMP23]], align 2
	; CHECK-NEXT: [[TMP28:%.*]] = extractelement <4 x i16> [[TMP8]], i32 3
	; CHECK-NEXT: store i16 [[TMP28]], ptr [[TMP24]], align 2
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP29:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP27:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP29]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP27]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_INC1286_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_INC1286_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[IF_THEN1165_US:%.*]]			; CHECK-NEXT: br label [[IF_THEN1165_US:%.*]]
	; CHECK: if.then1165.us:			; CHECK: if.then1165.us:
	; CHECK-NEXT: [[INDVARS_IV1783:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT1784:%.]], [[IF_THEN1165_US]] ]			; CHECK-NEXT: [[INDVARS_IV1783:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT1784:%.]], [[IF_THEN1165_US]] ]
	; CHECK-NEXT: [[GEP_A:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[INDVARS_IV1783]]			; CHECK-NEXT: [[GEP_A:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[INDVARS_IV1783]]
	▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[C]], align 4			; CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[C]], align 4
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP4]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP4]], i64 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr [[TMP5]], i32 0			; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr [[TMP5]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i16>, ptr [[TMP6]], align 2			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i16>, ptr [[TMP6]], align 2
	; CHECK-NEXT: [[TMP7:%.*]] = trunc <4 x i32> [[BROADCAST_SPLAT]] to <4 x i16>			; CHECK-NEXT: [[TMP7:%.*]] = trunc <4 x i32> [[BROADCAST_SPLAT]] to <4 x i16>
	; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i16> [[WIDE_LOAD]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i16> [[WIDE_LOAD]], [[TMP7]]
	; CHECK-NEXT: [[TMP9:%.*]] = zext <4 x i16> [[TMP8]] to <4 x i32>			; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP10:%.*]] = trunc <4 x i32> [[TMP9]] to <4 x i16>			; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP2]]
	; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP3]]
	; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP2]]			; CHECK-NEXT: [[TMP13:%.*]] = load i64, ptr [[TMP9]], align 8
	; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP3]]			; CHECK-NEXT: [[TMP14:%.*]] = load i64, ptr [[TMP10]], align 8
	; CHECK-NEXT: [[TMP15:%.*]] = load i64, ptr [[TMP11]], align 8			; CHECK-NEXT: [[TMP15:%.*]] = load i64, ptr [[TMP11]], align 8
	; CHECK-NEXT: [[TMP16:%.*]] = load i64, ptr [[TMP12]], align 8			; CHECK-NEXT: [[TMP16:%.*]] = load i64, ptr [[TMP12]], align 8
	; CHECK-NEXT: [[TMP17:%.*]] = load i64, ptr [[TMP13]], align 8			; CHECK-NEXT: [[TMP17:%.*]] = ashr exact i64 [[TMP13]], 32
	; CHECK-NEXT: [[TMP18:%.*]] = load i64, ptr [[TMP14]], align 8			; CHECK-NEXT: [[TMP18:%.*]] = ashr exact i64 [[TMP14]], 32
	; CHECK-NEXT: [[TMP19:%.*]] = ashr exact i64 [[TMP15]], 32			; CHECK-NEXT: [[TMP19:%.*]] = ashr exact i64 [[TMP15]], 32
	; CHECK-NEXT: [[TMP20:%.*]] = ashr exact i64 [[TMP16]], 32			; CHECK-NEXT: [[TMP20:%.*]] = ashr exact i64 [[TMP16]], 32
	; CHECK-NEXT: [[TMP21:%.*]] = ashr exact i64 [[TMP17]], 32			; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP17]]
	; CHECK-NEXT: [[TMP22:%.*]] = ashr exact i64 [[TMP18]], 32			; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP18]]
	; CHECK-NEXT: [[TMP23:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP19]]			; CHECK-NEXT: [[TMP23:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP19]]
	; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP20]]			; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP20]]
	; CHECK-NEXT: [[TMP25:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP21]]			; CHECK-NEXT: [[TMP25:%.*]] = extractelement <4 x i16> [[TMP8]], i32 0
	; CHECK-NEXT: [[TMP26:%.*]] = getelementptr inbounds i16, ptr [[M3]], i64 [[TMP22]]			; CHECK-NEXT: store i16 [[TMP25]], ptr [[TMP21]], align 2
	; CHECK-NEXT: [[TMP27:%.*]] = extractelement <4 x i16> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP26:%.*]] = extractelement <4 x i16> [[TMP8]], i32 1
				; CHECK-NEXT: store i16 [[TMP26]], ptr [[TMP22]], align 2
				; CHECK-NEXT: [[TMP27:%.*]] = extractelement <4 x i16> [[TMP8]], i32 2
	; CHECK-NEXT: store i16 [[TMP27]], ptr [[TMP23]], align 2			; CHECK-NEXT: store i16 [[TMP27]], ptr [[TMP23]], align 2
	; CHECK-NEXT: [[TMP28:%.*]] = extractelement <4 x i16> [[TMP10]], i32 1			; CHECK-NEXT: [[TMP28:%.*]] = extractelement <4 x i16> [[TMP8]], i32 3
	; CHECK-NEXT: store i16 [[TMP28]], ptr [[TMP24]], align 2			; CHECK-NEXT: store i16 [[TMP28]], ptr [[TMP24]], align 2
	; CHECK-NEXT: [[TMP29:%.*]] = extractelement <4 x i16> [[TMP10]], i32 2
	; CHECK-NEXT: store i16 [[TMP29]], ptr [[TMP25]], align 2
	; CHECK-NEXT: [[TMP30:%.*]] = extractelement <4 x i16> [[TMP10]], i32 3
	; CHECK-NEXT: store i16 [[TMP30]], ptr [[TMP26]], align 2
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP31:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP29:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP31]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP29]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_INC1286_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_INC1286_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[IF_THEN1165_US:%.*]]			; CHECK-NEXT: br label [[IF_THEN1165_US:%.*]]
	; CHECK: if.then1165.us:			; CHECK: if.then1165.us:
	; CHECK-NEXT: [[INDVARS_IV1783:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT1784:%.]], [[IF_THEN1165_US]] ]			; CHECK-NEXT: [[INDVARS_IV1783:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT1784:%.]], [[IF_THEN1165_US]] ]
	; CHECK-NEXT: [[FPTR:%.*]] = load i32, ptr [[C]], align 4			; CHECK-NEXT: [[FPTR:%.*]] = load i32, ptr [[C]], align 4
	Show All 39 Lines

llvm/test/Transforms/LoopVectorize/scalable-trunc-min-bitwidth.ll

	Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4			; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[LEN]], [[TMP3]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[LEN]], [[TMP3]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[LEN]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[LEN]], [[N_MOD_VF]]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[ARG1:%.]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[ARG1:%.]], i64 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP4:%.*]] = trunc <vscale x 4 x i32> [[BROADCAST_SPLAT]] to <vscale x 4 x i8>
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i8, ptr [[P:%.]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i8, ptr [[P:%.]], i64 [[INDEX]]
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i8>, ptr [[TMP4]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i8>, ptr [[TMP5]], align 1
	; CHECK-NEXT: [[TMP5:%.*]] = trunc <vscale x 4 x i32> [[BROADCAST_SPLAT]] to <vscale x 4 x i8>			; CHECK-NEXT: [[TMP6:%.*]] = xor <vscale x 4 x i8> [[WIDE_LOAD]], [[TMP4]]
	; CHECK-NEXT: [[TMP6:%.*]] = xor <vscale x 4 x i8> [[WIDE_LOAD]], [[TMP5]]
	; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i8> [[TMP6]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i8> [[TMP6]], [[WIDE_LOAD]]
	; CHECK-NEXT: store <vscale x 4 x i8> [[TMP7]], ptr [[TMP4]], align 1			; CHECK-NEXT: store <vscale x 4 x i8> [[TMP7]], ptr [[TMP5]], align 1
	; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP9:%.*]] = mul i64 [[TMP8]], 4			; CHECK-NEXT: [[TMP9:%.*]] = mul i64 [[TMP8]], 4
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP9]]			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP9]]
	; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[LEN]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[LEN]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_EXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_EXIT:%.*]], label [[SCALAR_PH]]
	Show All 40 Lines

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll

	Show First 20 Lines • Show All 322 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i8			; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i8
	; CHECK-NEXT: [[TMP0:%.*]] = add i8 [[OFFSET_IDX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i8 [[OFFSET_IDX]], 0
	; CHECK-NEXT: [[TMP1:%.*]] = zext i8 [[TMP0]] to i64			; CHECK-NEXT: [[TMP1:%.*]] = zext i8 [[TMP0]] to i64
	; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[TMP3]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[TMP3]], align 1
	; CHECK-NEXT: [[TMP4:%.*]] = zext <4 x i8> [[WIDE_LOAD]] to <4 x i16>			; CHECK-NEXT: [[TMP4:%.*]] = zext <4 x i8> [[WIDE_LOAD]] to <4 x i16>
	; CHECK-NEXT: [[TMP5:%.*]] = zext <4 x i16> [[TMP4]] to <4 x i32>			; CHECK-NEXT: [[TMP5:%.*]] = lshr <4 x i16> [[TMP4]], <i16 4, i16 4, i16 4, i16 4>
	; CHECK-NEXT: [[TMP6:%.*]] = trunc <4 x i32> [[TMP5]] to <4 x i16>			; CHECK-NEXT: [[TMP6:%.*]] = trunc <4 x i16> [[TMP5]] to <4 x i8>
	; CHECK-NEXT: [[TMP7:%.*]] = lshr <4 x i16> [[TMP6]], <i16 4, i16 4, i16 4, i16 4>			; CHECK-NEXT: store <4 x i8> [[TMP6]], ptr [[TMP3]], align 8
	; CHECK-NEXT: [[TMP8:%.*]] = zext <4 x i16> [[TMP7]] to <4 x i32>
	; CHECK-NEXT: [[TMP9:%.*]] = trunc <4 x i32> [[TMP8]] to <4 x i16>
	; CHECK-NEXT: [[TMP10:%.*]] = trunc <4 x i16> [[TMP9]] to <4 x i8>
	; CHECK-NEXT: store <4 x i8> [[TMP10]], ptr [[TMP3]], align 8
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
				AyalUnsubmitted Done Reply Inline Actions We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity? Ayal: We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not…
				fhahnAuthorUnsubmitted Done Reply Inline Actions trunc/ext pairs should be better cleaned up in the latest version fhahn: trunc/ext pairs should be better cleaned up in the latest version
				AyalUnsubmitted Not Done Reply Inline Actions Indeed! Ayal: Indeed!
	; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
	; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i8 [ 100, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i8 [ 100, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[IV:%.]] = phi i8 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[LOOP]] ]			; CHECK-NEXT: [[IV:%.]] = phi i8 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[IV_EXT:%.*]] = zext i8 [[IV]] to i64			; CHECK-NEXT: [[IV_EXT:%.*]] = zext i8 [[IV]] to i64
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 558186

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/lib/Transforms/Vectorize/VPlanTransforms.h

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll

llvm/test/Transforms/LoopVectorize/AArch64/type-shrinkage-insertelt.ll

llvm/test/Transforms/LoopVectorize/scalable-trunc-min-bitwidth.ll

llvm/test/Transforms/LoopVectorize/trunc-shifts.ll

[VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.
ClosedPublic