This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
IVDescriptors.h
-
lib/
-
Analysis/
14/19
IVDescriptors.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorizationPlanner.h
42/50
LoopVectorize.cpp
7/8
VPlan.h
1/1
VPlan.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
reduction-inloop-uf4.ll
3/5
reduction-inloop.ll

Differential D75069

[LoopVectorizer] Inloop vector reductions
ClosedPublic

Authored by dmgreen on Feb 24 2020, 11:17 AM.

Download Raw Diff

Details

Reviewers

Ayal
hsaito
fhahn
gilr
dcaballe
rengolin

Commits

rG745bf6cf4471: [LoopVectorizer] Inloop vector reductions
rGe9761688e41c: [LoopVectorizer] Inloop vector reductions

Summary

Arm MVE has multiple instructions such as VMLAVA.s8, which (in this case) can take two 128bit vectors, sign extend the inputs to i32, multiplying them together and sum the result into a 32bit general purpose register. So taking 16 i8's as inputs, they can multiply and accumulate the result into a single i32 without any rounding/truncating along the way. There are also reduction instructions for plain integer add and min/max, and operations that sum into a pair of 32bit registers together treated as a 64bit integer (even though MVE does not have a plain 64bit addition instruction). So giving the vectorizer the ability to use these instructions both enables us to vectorize at higher bitwidths, and to vectorize things we previously could not.

In order to do that we need a way to represent that the "reduction" operation, specified with a llvm.experimental.vector.reduce when vectorizing for Arm, occurs inside the loop not after it like most reductions. This patch attempts to do that, teaching the vectorizer about in-loop reductions. It does this through a vplan recipe representing the reductions that the original chain of reduction operations is replaced by. Cost modelling is currently just done through a "prefersInloopReduction" TTI hook (which follows in a later patch).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.Feb 24 2020, 11:17 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2020, 11:17 AM

Herald added subscribers: rogfer01, bollu, hiraditya, kristof.beyls. · View Herald Transcript

This patch does not work like that yet though, as I am unsure how that should really look, and it did not seem simple to create a vplan from another yet. The def-use relations are usually not in place for example. Neither is the costing of vplans against one another, which makes sense.

That's correct. Work on modelling the def-use relationships exclusively in VPlan is currently underway, but it probably does not make sense to wait with this patch until the transition is complete.

Some small high-level comments:

I think it's best to move the TTI hook into a separate change
Do you know how 'experimental' the reduction intrinsics still are? It would be good to move them out of experimental at some point, especially when making use of them in LV.

I'm currently traveling, but I try to take a closer look next week or the week after (please ping in case I forget ;))

Thanks for taking a look!

I noticed a comment in VPValue that says:

DESIGN PRINCIPLE: Access to the underlying IR must be strictly limited to
the front-end and back-end of VPlan so that the middle-end is as
independent as possible of the underlying IR. We grant access to the
underlying IR using friendship. In that way, we should be able to use VPlan
for multiple underlying IRs (Polly?) by providing a new VPlan front-end,
back-end and analysis information for the new IR.

That's the part that confused me about optimising vplans in-place. Does that mean that we have to replicate all the opcode/flags data into vplan too? Not just the use-defs. So it knows what is an Add and what is a Mul (and if they are nsw etc). What would you say counts as "middle-end" for vplan? I would presume the costmodelling and creating vplans from one another, but I didn't follow vplans early days very closely.

I think it's best to move the TTI hook into a separate change

Can do. I have some other changes that I was trying out too about adding a VPReductionPhi. Not sure it makes sense yet though.

Do you know how 'experimental' the reduction intrinsics still are? It would be good to move them out of experimental at some point, especially when making use of them in LV.

As far as I understand, on architectures that request it (AArch64 and Arm MVE now), we have been using the "experimental" reduction intrinsics since they were added, in the code generated after the vector body. There is a useReductionIntrinsic which the targets can override to prefer them over the IR equivalents.

I would like to add a mask to the reduction intrinsics. That will be important for getting tail-predicated inloop reductions to work. Whether this is done by adjusting the existing ones or adding new ones, I'm not sure. Obviously some architectures do not handle the mask, only handling a all-one mask. After that is sorted out then I agree it would be nice to drop the "experimental".

dmgreen mentioned this in D75512: [LoopVectorizer][ARM] Add preferInloopReduction target hook..Mar 3 2020, 3:59 AM

Taken out the TTI change and simplified the tests, just shoing the diffs here now.

dmgreen added a child revision: D75512: [LoopVectorizer][ARM] Add preferInloopReduction target hook..Mar 3 2020, 4:08 AM

Rebase and a bit of a cleanup. Now uses VPValues for the VPRecipe operands.

Herald added a subscriber: vkmr. · View Herald TranscriptMay 9 2020, 10:55 AM

Ping. Anyone have any thought/comments about this one? Does it at least look like its going in the right direction, or does anyone have some better suggestions?

In D75069#2041073, @dmgreen wrote:

Ping. Anyone have any thought/comments about this one? Does it at least look like its going in the right direction, or does anyone have some better suggestions?

Yes, this direction does look right!
Would be good to clarify that the unrolled Parts are still reduced out of loop, and document the current limitations / future TODO extensions, including min/max, conditional bumps, truncate/extend.
Added various minor comments.

llvm/lib/Analysis/IVDescriptors.cpp
814	Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr? Note that when extended to handle reductions with conditional bumps, some ops will have more than one use.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
269	"Force" >> "Prefer"? If a reduction cannot be handled in loop, it is vectorized out of loop, rather than bailing out.
1082	collectInLoopReductions()? Perhaps worth holding a map of in loop reduction phi's with their chains.
1352	isInLoopReduction(Phi)?
1357	Simply call InLoopReductions.empty() directly?
1360	hasOutOfLoopReductions()? Outerloop stands for a non innermost loop in a nest.
3829	isInLoopReductionPhi
3847	This is dead code if cmp/select chains are not recognized yet, as noted above.
3925	Checking if !UseInLoopReductions is redundant, as current in loop reductions have phi's of the same type as the recurrence; right? May leave it as an assert, until in loop reductions handle truncation/extension.
3976–3978	Worth adding a comment that in loop reductions create their target reductions inside the loop using a recipe, rather that here in the middle block.
6696	Move this invariant check out as another early-exit?
7451	Iterate over in loop reductions?
7584	In loop reductions are currently disabled in fold tail. Checking also if hasOuterloopReductions() is an independent optimization that should be done separately of this patch.
7586	Iterate over in loop reductions? Comment that they are ordered top-down, starting from the phi.
7589	This is currently dead code, no in loop reductions under fold tail.
7803	NewChain, NewAcc >> PrevInChain, NextInChain?
llvm/lib/Transforms/Vectorize/VPlan.h
240	Too bad this requires passing TTI through the State everywhere. Perhaps storing TTI in the recipe would be somewhat better.
1037	Comment is redundant.
1039	private is redundant.
llvm/lib/Transforms/Vectorize/VPlanValue.h
47 ↗	(On Diff #263027)	Why is this friendship needed?
llvm/test/Transforms/LoopVectorize/icmp-uniforms.ll
43 ↗	(On Diff #263027)	Optimizing this comparison away is independent of this patch.
llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
2	It is indeed good to examine the diff in this review, but existing non -force-inloop-reductions RUN should be retained; e.g., by cloning the file for a separate -force-inloop-reductions RUN. Would be good to also test with -force-vector-interleave greater than 1.

Thanks for the review! You rock. I'll update this as per the comments.

fhahn added inline comments.May 20 2020, 3:55 AM

llvm/lib/Analysis/IVDescriptors.cpp
801	nit: would be good to document what this function does
804	nit: would be good to add a message
llvm/lib/Transforms/Vectorize/VPlan.cpp
805	The ::print methods do not need to emit the " +\n" at the start and the "\\l\" parts after 4c8285c750bc
llvm/lib/Transforms/Vectorize/VPlan.h
43	Would a forward decl of ReductionDesciptor be sufficient or is there another need to include LoopUtils.h?
1041	would be good to document the members.
llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
6–52	I think it would be also good to add a few additional simpler test cases, e.g. reductions with only 1 or 2 reduction steps in the loop, some with constant scalar operands. And with some of the (unnecessary trounces stripped). Also the test diff can be slightly reduced by having constant trip counts. It's good to have the more complex ones, but it would also be good to start with simpler cases. That way it is probably easier to follow what's going on. And it would reduce the diff quite a bit if the tests for the various different opcodes would be a bit more compact.

fhahn added inline comments.May 20 2020, 3:58 AM

llvm/lib/Analysis/IVDescriptors.cpp
814	Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at a single use chain?

simoll added a subscriber: simoll.May 20 2020, 5:18 AM

dmgreen updated this revision to Diff 265371.May 20 2020, 4:09 PM

dmgreen marked 36 inline comments as done.

dmgreen edited the summary of this revision. (Show Details)

Round one. Lots of renaming, plenty of cleanup. I've tried to add Min/Max handling too, and added/cleaned up some more test. Hopefully addressed most of the comments.

dmgreen added inline comments.May 20 2020, 4:45 PM

llvm/lib/Analysis/IVDescriptors.cpp
814	Yeah, that direction makes it a lot simpler. Thanks.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3847	I've added the code to handle minmax too (but not tested it a lot yet. I will try that now). MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.) My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away.
3925	Right. We look from the phi down (now), so can't go through the 'and' we would need to detect the different bitwidth. That would leave 'and' reduction, which I think would not got through that codepath to create type promoted phis? I've put an explicit check in for it to be sure, and added some testing for those cases.
6696	This does look a little strange here on it's own. The followup patch to add the TTI hook makes it look like: if (!PreferInloopReductions && !TTI.preferInloopReduction(Opcode, Phi->getType(), TargetTransformInfo::ReductionFlags())) continue;
7451	Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop? It would seem like it's similar to the existing code, but as a new iterator class. My gut says the current code is simpler and clearer what is going on?
llvm/lib/Transforms/Vectorize/VPlan.h
240	I've changed it to be stored there. It does mean multiple things are holding TTI. Let me know what you think.
llvm/lib/Transforms/Vectorize/VPlanValue.h
47 ↗	(On Diff #263027)	Ah. Old code no longer needed.
llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
2	This testcase is new for this patch (although that wasn't obvious from how I presented it). I was just trying to show the diffs here and the option didn't exist before this patch, hence it wasn't already there. Are you suggesting we should still have these same test for without -force-inloop-reductions? I think those should already be tested elsewhere, but I can add it if you think it's useful.
6–52	(unnecessary trounces stripped)? The const case is going to look a little odd until there is a instcombine type fold for constant reductions. I'll see about adding that. Otherwise it is now hopefully a bit leaner. Even if I have added a few extra tests.

ping

The proposed RecurrenceDescriptor::getReductionOpChain() method identifies horizontal reductions, a subset of the vertical reductions identified by RecurrenceDescriptor::isReductionPHI(). Being two totally independent methods, it's unclear what the latter supports that the former does not. Would it be better to have isReductionPHI() also take care of recognizing horizontal reductions, and record them as in CastsToIgnore? See also TODO commented inline.

llvm/lib/Analysis/IVDescriptors.cpp
439–441	Perhaps the above "better way" would also help recognize and record horizontal reductions?
850	`\|\| !Cur->hasNUses(ExpectedUses)` ? nit: can alternatively let getNextInstruction check its result and return only valid ones, e.g.: bool RedOpIsCmp = (RedOp == Instruction::ICmp \|\| RedOp == Instruction::FCmp); unsigned ExpectedUses = RedOpIsCmp ? 2 : 1; auto getNextInstruction = [&](Instruction Cur) { if (!Cur->hasNUses(ExpectedUses)) return nullptr; auto FirstUser = Cur->user_begin(); if (!RedOpIsCmp) return FirstUser->getOpcode() == RedOp ? FirstUser : nullptr; // Handle cmp/select pair: auto Sel = dyn_cast<SelectInst>(FirstUser) \|\| dyn_cast<SelectInst>(std::next(FirstUser)); if (SelectPatternResult::isMinOrMax(matchSelectPattern(Sel, LHS, RHS).Flavor)) return Sel; return nullptr; } for (auto Cur = getNextInstruction(Phi); Cur && Cur != LoopExitInstr; Cur = getNextInstruction(Cur)) ReductionOperations.push_back(Cur);
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1495	... along with their associated chains of reduction operations, in program order from top (PHI) to bottom?
3847	Would be good to make sure code is being exercised and tested. Could inloop min/max (and/or other reductions) help reduce code size, and be applied when vectorizing under optsize?
3925	'and' reduction may or may not be performed in a smaller type, iinm.
4276	Check in-loop reduction along with VF == 1, instead of modifying VF, as in other places? E.g., bool ScalarPHI = (VF == 1) \|\| Cost->isInLoopReduction(cast<PHINode>(PN)); Type *VecTy = ScalarPHI ? PN->getType() : VectorType::get(PN->getType(), VF); ?
6714	The "else" clause will be empty in non-debug mode. Can hoist the LLVM_DEBUG's, e.g., bool InLoop = !ReductionOperations.empty(); LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop") << " reduction for phi: " << *Phi << "\n");
7451	Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars(). (Better early-exit via "if (!CM.isInLoopReduction(Reduction.first)) continue;")
7460	Add a comment that this is the compare operand of the select, whose recipe will also be discarded.
7582	Consider outlining this part, unless it can be shortened.
7584	Checking also if hasOuterloopReductions() is an independent optimization that should be done separately of this patch.
7586	Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars().
7613	`ChainOp` can be simply set: VPValue ChainOp = Plan->getVPValue(Chain); leaving only VecOp to figure out which operand index to use. Can preset the options before the loop: RecurrenceDescriptor::RecurrenceKind Kind = RdxDesc.getRecurrenceKind(); bool InMinMax = (Kind == RecurrenceDescriptor::RK_IntegerMinMax \|\| Kind == RecurrenceDescriptor::RK_FloatMinMax); FirstOpId = IsMinMax ? 1 : 0; SecondOpId = FirstOpId + 1; and then do auto FirstOp = R->getOperand(FirstOpId); auto SecondOp = R->getOperand(SecondOpId); VPValue VecOp = Plan->getVPValue(FirstOp == Chain ? SecondOp : FirstOp);
llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
2	ok, understood; no need to retain non-existing tests.

dmgreen mentioned this in D81415: [LoopVectorizer] Don't create unused block masks for reductions. NFC.Jun 8 2020, 11:35 AM

dmgreen marked 15 inline comments as done.

dmgreen added inline comments.

llvm/lib/Analysis/IVDescriptors.cpp
439–441	Hmm. The reason I kept them separate was that this method is already pretty complex. I was trying to keep thing simpler. Adding the ability to detect a single chain of operations from Phi to LoopExitValue that can be used for horizontal reductions looks.. difficult. And error prone. :) If you think it's worth it then I can certainly give it a go! I like the separation of concerns in keeping them separate though. The extra things that AddReductionVar will detect that getReductionOpChain will not are: Phi/select predicated reductions like in if-conversion-reductions.ll and if-reduction.ll. These would need some form of predicated reduction intrinsic. Narrow bitwidths. This one I could add. Subs/FSubs are treated like Adds/FAdds for vertical reductions.
850	This is the loop exit instr, so can have as many uses as it likes I believe.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3847	-Os sounds like a good plan. It will take some backend work to make it efficient enough first though. And predicated reductions?
3925	The check for lookThroughAnd in AddReductionVar is guarded by isArithmeticRecurrenceKind, so at least currently it cannot be an And reduction.
7451	I believe that InloopReductionChains would not iterate in a deterministic order, which is why I avoided it. Perhaps that would not matter here? The reductions should be independent anyway. Seems safer to try and use deterministic ordering anyway if we can.
7584	Ah sorry. I split it off but apparently didn't upstream the other part yet. Now in D81415.

Updates

ping

Herald added a subscriber: bmahjour. · View Herald TranscriptJun 21 2020, 11:58 PM

bmahjour removed a subscriber: bmahjour.Jun 22 2020, 6:47 AM

ping :)

dmgreen mentioned this in D82953: [ARM][MVE] Only tail-fold integer add reductions.Jul 2 2020, 12:42 AM

Ayal added inline comments.Jul 6 2020, 1:15 PM

llvm/lib/Analysis/IVDescriptors.cpp
439–441	Would be good to see if above TODO can be addressed - providing the set of all instructions that take part in the reduction. This set could then be used for checking in-loop reductions. Hopefully this could help simplify both, and keep them in some sync. But could be done later, possibly with another TODO..
814	Is treating sub as an add reduction something in-loop reduction could support as a future extension?
850	Ahh, ok. (It should have ExpectedUses+1 users being in lcssa.) "instruciton"
858	Is Subs the only issue?. Can check this earlier, before traversing the chain, although it is pushed back last, here.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1082	Thanks. Worth updating the comment.
3847	Hoisting the horizontal reduction from the middle block into the loop could potentially eliminate the middle block (as in tests below), so could presumably lead to code of smaller size? At-least for in-loop chains of a single link. And predicated reductions? These are yet to be handled in-loop, right?
6696	Then better placed above right after defining Phi?
7451	Agreed it would be better to use deterministic ordering. How about letting InloopReductionChains be a MapVector and iterate over for (auto &Reduction : CM.getInloopReductions())? The number of reductions is expected to be small, w/o removals.
7651	This is the other potential use of for (auto &Reduction : CM.getInloopReductions()).

Thanks for taking another look.

llvm/lib/Analysis/IVDescriptors.cpp
814	Hmm. I don't want to say never. A normal inloop reduction looks like: p = PHI(0, a) l = VLDR (..) a = VADDVA(p, l) Where the `VADDV` is an across-vector reductions, and the extra `A` means also add p. Reducing a sub would need to become: p = PHI(0, a) l = VLDR (..) a = VADDV(l) p = SUB(p, a) With the SUB as a separate scalar instruction, which would be quite slow on some hardware (getting a value over from the VADDV to the SUB). So this would almost certainly be slower than a out-of-loop reduction. But if we could end up using a higher vector factor for the reduction, or end up vectorizing loops that would previously not be vectorized.. that may lead to a gain overall to overcome the extra cost of adding the sub to the loop. It will require some very careful costing I think. And maybe the ability to create multiple vplans and cost them against one another :)
858	I believe sub is the only issue, unless I am forgetting something.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3847	And predicated reductions? These are yet to be handled in-loop, right? Yep. It will need a predicated reduction intrinsic. A vecreduce that takes a mask. That will allow us to tail-fold the reductions with trip counts that do not divide the vector factor, which will make them look a lot better under -Os. And nice in general I think once it all starts being tail predicated. The backend work I was mentioning was that we need to more efficiently transform x = min(vecreduce.min(z), y) into x = VMINV(y, z) Where y is (confusingly) accumulated in the case (even though the instruction doesn't have an A suffix). We currently generate x = min(VMINV(UINT_MAX, z), y) Once that is sorted out then, yep, using these for Os sounds like a good plan.
6696	I can put it somewhere sensible in this patch and move it in the next :)
7451	MapVector sounds good. I've changed it to use that and tried to use that in a few more places. Let me know what you think.

Ayal added inline comments.Jul 7 2020, 12:33 PM

llvm/lib/Analysis/IVDescriptors.cpp
814	An original sub code, say, acc -= a[i], can be treated as acc += (-a[i]). This could be in-loop reduced by first negating a[i]'s, at LV's LLVM-IR level, presumably lowered later to something like p = PHI(0, a) l = VLDR (..) s = VSUBV (zero, l) a = VADDVA(p, s) , right?
844	rephrase sentence so it parses
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3829	Ahh, this should actually be capitalized `IsInLoopReductionPhi`
3847	Re: predicated reductions - could they be handled by replacing masked-off elements with `Identity` using a select prior to reduction? To be potentially folded later by suitable targets into a predicated reduction operation which they may support. Somewhat akin to "passthru" values of masked loads.
7451	Uses as MapVector look good to me, thanks. Can also retain isInLoopReduction(PHI).
7457	recode >> record

Reinstate isInLoopReduction and fixup some wording/typos/capitalisation.

llvm/lib/Analysis/IVDescriptors.cpp
814	Yep. We would have the option to trading a scalar instruction for a vector instruction + an extra register (to hold the 0, we only have 8 registers!) Unfortunately both would be slower than in out-of-loop reduction unless we were vectorizing at a higher factor, though.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3847	Oh. So select s = select m, a, 0 v = vecreduce.add s to a predicated vaddv? Yeah, sounds interesting. I'll look into that as an alternative to predicated intrinsics. Nice suggestion.

Ayal added inline comments.Jul 7 2020, 3:05 PM

llvm/lib/Analysis/IVDescriptors.cpp
814	ok, so sub's can be handled in-loop, but doing so is expected to be more costly than out-of-loop, at-least if a horizontal add operation is to be used rather than a horizontal subtract; probably worth a comment. If a reduction chain has only sub's, they could all sink - negating the sum once after the loop, using VADDVA inside. Doing so however will retain the middle block, i.e., w/o decreasing code size.

Update a FIXME with sub info.

gilr added inline comments.Jul 8 2020, 6:42 AM

llvm/lib/Transforms/Vectorize/VPlan.h
240	It seems that TTI is only used later for deciding whether to use a shuffle sequence or an intrinsic based on data available during planning. If so, then it would be best if the Planner calls TTI->useReductionIntrinsic() and records that boolean decision in the Recipe. This is also required in order to estimate in-loop reduction cost. This could be done separately.

dmgreen marked an inline comment as done.Jul 9 2020, 10:15 AM

dmgreen added inline comments.

llvm/lib/Transforms/Vectorize/VPlan.h
240	Do you mean to change the interface to createTargetReduction, to take a bool instead? Yeah I think that sounds good. I'd prefer to do it as a separate review as it does involve changing the interface. I will put a patch together. I was imagining that we would change the cost to use getArithmeticReductionCost, which hopefully handles the details of how the target lowers reductions. I haven't looked deeply into the details yet though. That is on the list of things to do.

This looks good to me, thanks! with last couple of nits.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6700	nit: "... that leads from the loop exit value back.." - chain is now found top-down.
7463	nit: can record the recipe of Phi first, just to follow chain order.
7689	nit: can assert CompareRecipe->getVPRecipeID()

This revision is now accepted and ready to land.Jul 12 2020, 7:35 AM

Thanks. I will update the patch, but I will wait until at least after the branch before I commit it.

D83646 is an attempt at changing the interface to createTargetReduction, so that we don't need to store TTI in the reduction recipe.

Updates for those last nitpicks.

dmgreen mentioned this in D84451: [LV] Tail folded inloop reductions..Jul 23 2020, 11:10 AM

Closed by commit rGe9761688e41c: [LoopVectorizer] Inloop vector reductions (authored by dmgreen). · Explain WhyAug 5 2020, 10:14 AM

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rGe9761688e41c: [LoopVectorizer] Inloop vector reductions.

rupprecht added a reverting change: rG3c39db0c4452: Revert "[LoopVectorizer] Inloop vector reductions".Aug 5 2020, 10:27 AM

FYI, reverted in 3c39db0c4452218c967a8ac3ad48144fbf1159ff. See http://lab.llvm.org:8011/builders/clang-cmake-armv7-quick/builds/19573/steps/build%20stage%201/logs/stdio for a breakage that's shown up in buildbots.

Thanks. Sorry about that, it appears clang doesn't like this code as much as gcc did and then my internet went out whilst I was trying to figure out what was wrong.

I will try to run some extra testing using different compilers.

dmgreen added a commit: rG745bf6cf4471: [LoopVectorizer] Inloop vector reductions.Aug 6 2020, 2:11 AM

kmclaughlin mentioned this in D98435: [LoopVectorize] Add strict in-order reduction support for fixed-width vectorization.Mar 11 2021, 9:47 AM

kmclaughlin mentioned this in rG7344f3d39a0d: [LoopVectorize] Add strict in-order reduction support for fixed-width….Apr 6 2021, 6:46 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

IVDescriptors.h

5 lines

lib/

Analysis/

IVDescriptors.cpp

72 lines

Transforms/

Vectorize/

LoopVectorizationPlanner.h

8 lines

LoopVectorize.cpp

174 lines

VPlan.h

39 lines

VPlan.cpp

9 lines

test/

Transforms/

LoopVectorize/

reduction-inloop-uf4.ll

35 lines

reduction-inloop.ll

232 lines

Diff 283287

llvm/include/llvm/Analysis/IVDescriptors.h

Show First 20 Lines • Show All 214 Lines • ▼ Show 20 Lines	public:

/// Returns a reference to the instructions used for type-promoting the		/// Returns a reference to the instructions used for type-promoting the
/// recurrence.		/// recurrence.
SmallPtrSet<Instruction *, 8> &getCastInsts() { return CastInsts; }		SmallPtrSet<Instruction *, 8> &getCastInsts() { return CastInsts; }

/// Returns true if all source operands of the recurrence are SExtInsts.		/// Returns true if all source operands of the recurrence are SExtInsts.
bool isSigned() { return IsSigned; }		bool isSigned() { return IsSigned; }

		/// Attempts to find a chain of operations from Phi to LoopExitInst that can
		/// be treated as a set of reductions instructions for in-loop reductions.
		SmallVector<Instruction , 4> getReductionOpChain(PHINode Phi,
		Loop *L) const;

private:		private:
// The starting value of the recurrence.		// The starting value of the recurrence.
// It does not have to be zero!		// It does not have to be zero!
TrackingVH<Value> StartValue;		TrackingVH<Value> StartValue;
// The instruction who's value is used outside the loop.		// The instruction who's value is used outside the loop.
Instruction *LoopExitInstr = nullptr;		Instruction *LoopExitInstr = nullptr;
// The kind of the recurrence.		// The kind of the recurrence.
RecurrenceKind Kind = RK_NoRecurrence;		RecurrenceKind Kind = RK_NoRecurrence;
▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines

llvm/lib/Analysis/IVDescriptors.cpp

Show First 20 Lines • Show All 430 Lines • ▼ Show 20 Lines	if (Start != Phi) {
// The recurrence expression will be represented in a narrower type. If		// The recurrence expression will be represented in a narrower type. If
// there are any cast instructions that will be unnecessary, collect them		// there are any cast instructions that will be unnecessary, collect them
// in CastInsts. Note that the 'and' instruction was already included in		// in CastInsts. Note that the 'and' instruction was already included in
// this list.		// this list.
//		//
// TODO: A better way to represent this may be to tag in some way all the		// TODO: A better way to represent this may be to tag in some way all the
// instructions that are a part of the reduction. The vectorizer cost		// instructions that are a part of the reduction. The vectorizer cost
// model could then apply the recurrence type to these instructions,		// model could then apply the recurrence type to these instructions,
// without needing a white list of instructions to ignore.		// without needing a white list of instructions to ignore.
		// This may also be useful for the inloop reductions, if it can be
		// kept simple enough.
		AyalUnsubmitted Done Reply Inline Actions Perhaps the above "better way" would also help recognize and record horizontal reductions? Ayal: Perhaps the above "better way" would also help recognize and record horizontal reductions?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Hmm. The reason I kept them separate was that this method is already pretty complex. I was trying to keep thing simpler. Adding the ability to detect a single chain of operations from Phi to LoopExitValue that can be used for horizontal reductions looks.. difficult. And error prone. :) If you think it's worth it then I can certainly give it a go! I like the separation of concerns in keeping them separate though. The extra things that AddReductionVar will detect that getReductionOpChain will not are: Phi/select predicated reductions like in if-conversion-reductions.ll and if-reduction.ll. These would need some form of predicated reduction intrinsic. Narrow bitwidths. This one I could add. Subs/FSubs are treated like Adds/FAdds for vertical reductions. dmgreen: Hmm. The reason I kept them separate was that this method is already pretty complex. I was…
		AyalUnsubmitted Done Reply Inline Actions Would be good to see if above TODO can be addressed - providing the set of all instructions that take part in the reduction. This set could then be used for checking in-loop reductions. Hopefully this could help simplify both, and keep them in some sync. But could be done later, possibly with another TODO.. Ayal: Would be good to see if above TODO can be addressed - providing the set of all instructions…
collectCastsToIgnore(TheLoop, ExitInstruction, RecurrenceType, CastInsts);		collectCastsToIgnore(TheLoop, ExitInstruction, RecurrenceType, CastInsts);
}		}

// We found a reduction var if we have reached the original phi node and we		// We found a reduction var if we have reached the original phi node and we
// only have a single instruction with out-of-loop users.		// only have a single instruction with out-of-loop users.

// The ExitInstruction(Instruction which is allowed to have out-of-loop users)		// The ExitInstruction(Instruction which is allowed to have out-of-loop users)
// is saved as part of the RecurrenceDescriptor.		// is saved as part of the RecurrenceDescriptor.
▲ Show 20 Lines • Show All 343 Lines • ▼ Show 20 Lines	case RK_IntegerMinMax:
return Instruction::ICmp;		return Instruction::ICmp;
case RK_FloatMinMax:		case RK_FloatMinMax:
return Instruction::FCmp;		return Instruction::FCmp;
default:		default:
llvm_unreachable("Unknown recurrence operation");		llvm_unreachable("Unknown recurrence operation");
}		}
}		}

		SmallVector<Instruction *, 4>
		fhahnUnsubmitted Done Reply Inline Actions nit: would be good to document what this function does fhahn: nit: would be good to document what this function does
		RecurrenceDescriptor::getReductionOpChain(PHINode Phi, Loop L) const {
		SmallVector<Instruction *, 8> ReductionOperations;
		unsigned RedOp = getRecurrenceBinOp(Kind);
		fhahnUnsubmitted Done Reply Inline Actions nit: would be good to add a message fhahn: nit: would be good to add a message

		// Search down from the Phi to the LoopExitInstr, looking for instructions
		// with a single user of the correct type for the reduction.

		// Note that we check that the type of the operand is correct for each item in
		// the chain, including the last (the loop exit value). This can come up from
		// sub, which would otherwise be treated as an add reduction. MinMax also need
		// to check for a pair of icmp/select, for which we use getNextInstruction and
		// isCorrectOpcode functions to step the right number of instruction, and
		// check the icmp/select pair.
		AyalUnsubmitted Done Reply Inline Actions Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr? Note that when extended to handle reductions with conditional bumps, some ops will have more than one use. Ayal: Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards…
		fhahnUnsubmitted Done Reply Inline Actions Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at a single use chain? fhahn: Instead of doing a recursive traversal, would it be simpler to just do the traversal…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yeah, that direction makes it a lot simpler. Thanks. dmgreen: Yeah, that direction makes it a lot simpler. Thanks.
		AyalUnsubmitted Not Done Reply Inline Actions Is treating sub as an add reduction something in-loop reduction could support as a future extension? Ayal: Is treating sub as an add reduction something in-loop reduction could support as a future…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Hmm. I don't want to say never. A normal inloop reduction looks like: p = PHI(0, a) l = VLDR (..) a = VADDVA(p, l) Where the `VADDV` is an across-vector reductions, and the extra `A` means also add p. Reducing a sub would need to become: p = PHI(0, a) l = VLDR (..) a = VADDV(l) p = SUB(p, a) With the SUB as a separate scalar instruction, which would be quite slow on some hardware (getting a value over from the VADDV to the SUB). So this would almost certainly be slower than a out-of-loop reduction. But if we could end up using a higher vector factor for the reduction, or end up vectorizing loops that would previously not be vectorized.. that may lead to a gain overall to overcome the extra cost of adding the sub to the loop. It will require some very careful costing I think. And maybe the ability to create multiple vplans and cost them against one another :) dmgreen: Hmm. I don't want to say never. A normal inloop reduction looks like: p = PHI(0, a) l =…
		AyalUnsubmitted Not Done Reply Inline Actions An original sub code, say, acc -= a[i], can be treated as acc += (-a[i]). This could be in-loop reduced by first negating a[i]'s, at LV's LLVM-IR level, presumably lowered later to something like p = PHI(0, a) l = VLDR (..) s = VSUBV (zero, l) a = VADDVA(p, s) , right? Ayal: An original sub code, say, acc -= a[i], can be treated as acc += (-a[i]). This could be in-loop…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yep. We would have the option to trading a scalar instruction for a vector instruction + an extra register (to hold the 0, we only have 8 registers!) Unfortunately both would be slower than in out-of-loop reduction unless we were vectorizing at a higher factor, though. dmgreen: Yep. We would have the option to trading a scalar instruction for a vector instruction + an…
		AyalUnsubmitted Not Done Reply Inline Actions ok, so sub's can be handled in-loop, but doing so is expected to be more costly than out-of-loop, at-least if a horizontal add operation is to be used rather than a horizontal subtract; probably worth a comment. If a reduction chain has only sub's, they could all sink - negating the sum once after the loop, using VADDVA inside. Doing so however will retain the middle block, i.e., w/o decreasing code size. Ayal: ok, so sub's can be handled in-loop, but doing so is expected to be more costly than out-of…
		// FIXME: We also do not attempt to look through Phi/Select's yet, which might
		// be part of the reduction chain, or attempt to looks through And's to find a
		// smaller bitwidth. Subs are also currently not allowed (which are usually
		// treated as part of a add reduction) as they are expected to generally be
		// more expensive than out-of-loop reductions, and need to be costed more
		// carefully.
		unsigned ExpectedUses = 1;
		if (RedOp == Instruction::ICmp \|\| RedOp == Instruction::FCmp)
		ExpectedUses = 2;

		auto getNextInstruction = [&](Instruction *Cur) {
		if (RedOp == Instruction::ICmp \|\| RedOp == Instruction::FCmp) {
		// We are expecting a icmp/select pair, which we go to the next select
		// instruction if we can. We already know that Cur has 2 uses.
		if (isa<SelectInst>(*Cur->user_begin()))
		return cast<Instruction>(*Cur->user_begin());
		else
		return cast<Instruction>(*std::next(Cur->user_begin()));
		}
		return cast<Instruction>(*Cur->user_begin());
		};
		auto isCorrectOpcode = [&](Instruction *Cur) {
		if (RedOp == Instruction::ICmp \|\| RedOp == Instruction::FCmp) {
		Value LHS, RHS;
		return SelectPatternResult::isMinOrMax(
		matchSelectPattern(Cur, LHS, RHS).Flavor);
		}
		return Cur->getOpcode() == RedOp;
		};

		AyalUnsubmitted Done Reply Inline Actions rephrase sentence so it parses Ayal: rephrase sentence so it parses
		// The loop exit instruction we check first (as a quick test) but add last. We
		// check the opcode is correct (and dont allow them to be Subs) and that they
		// have expected to have the expected number of uses. They will have one use
		// from the phi and one from a LCSSA value, no matter the type.
		if (!isCorrectOpcode(LoopExitInstr) \|\| !LoopExitInstr->hasNUses(2))
		return {};
		AyalUnsubmitted Not Done Reply Inline Actions `\|\| !Cur->hasNUses(ExpectedUses)` ? nit: can alternatively let getNextInstruction check its result and return only valid ones, e.g.: bool RedOpIsCmp = (RedOp == Instruction::ICmp \|\| RedOp == Instruction::FCmp); unsigned ExpectedUses = RedOpIsCmp ? 2 : 1; auto getNextInstruction = [&](Instruction Cur) { if (!Cur->hasNUses(ExpectedUses)) return nullptr; auto FirstUser = Cur->user_begin(); if (!RedOpIsCmp) return FirstUser->getOpcode() == RedOp ? FirstUser : nullptr; // Handle cmp/select pair: auto Sel = dyn_cast<SelectInst>(FirstUser) \|\| dyn_cast<SelectInst>(std::next(FirstUser)); if (SelectPatternResult::isMinOrMax(matchSelectPattern(Sel, LHS, RHS).Flavor)) return Sel; return nullptr; } for (auto Cur = getNextInstruction(Phi); Cur && Cur != LoopExitInstr; Cur = getNextInstruction(Cur)) ReductionOperations.push_back(Cur); Ayal: `\|\| !Cur->hasNUses(ExpectedUses)` ? nit: can alternatively let getNextInstruction check its…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions This is the loop exit instr, so can have as many uses as it likes I believe. dmgreen: This is the loop exit instr, so can have as many uses as it likes I believe.
		AyalUnsubmitted Not Done Reply Inline Actions Ahh, ok. (It should have ExpectedUses+1 users being in lcssa.) "instruciton" Ayal: Ahh, ok. (It should have ExpectedUses+1 users being in lcssa.) "instruciton"

		// Check that the Phi has one (or two for min/max) uses.
		if (!Phi->hasNUses(ExpectedUses))
		return {};
		Instruction *Cur = getNextInstruction(Phi);

		// Each other instruction in the chain should have the expected number of uses
		// and be the correct opcode.
		AyalUnsubmitted Done Reply Inline Actions Is Subs the only issue?. Can check this earlier, before traversing the chain, although it is pushed back last, here. Ayal: Is Subs the only issue?. Can check this earlier, before traversing the chain, although it is…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I believe sub is the only issue, unless I am forgetting something. dmgreen: I believe sub is the only issue, unless I am forgetting something.
		while (Cur != LoopExitInstr) {
		if (!isCorrectOpcode(Cur) \|\| !Cur->hasNUses(ExpectedUses))
		return {};

		ReductionOperations.push_back(Cur);
		Cur = getNextInstruction(Cur);
		}

		ReductionOperations.push_back(Cur);
		return ReductionOperations;
		}

InductionDescriptor::InductionDescriptor(Value *Start, InductionKind K,		InductionDescriptor::InductionDescriptor(Value *Start, InductionKind K,
const SCEV Step, BinaryOperator BOp,		const SCEV Step, BinaryOperator BOp,
SmallVectorImpl<Instruction > Casts)		SmallVectorImpl<Instruction > Casts)
: StartValue(Start), IK(K), Step(Step), InductionBinOp(BOp) {		: StartValue(Start), IK(K), Step(Step), InductionBinOp(BOp) {
assert(IK != IK_NoInduction && "Not an induction");		assert(IK != IK_NoInduction && "Not an induction");

// Start value type should match the induction kind and the value		// Start value type should match the induction kind and the value
// itself should not be null.		// itself should not be null.
▲ Show 20 Lines • Show All 318 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show All 28 Lines
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"

namespace llvm {		namespace llvm {

class LoopVectorizationLegality;		class LoopVectorizationLegality;
class LoopVectorizationCostModel;		class LoopVectorizationCostModel;
class PredicatedScalarEvolution;		class PredicatedScalarEvolution;
		class VPRecipeBuilder;

/// VPlan-based builder utility analogous to IRBuilder.		/// VPlan-based builder utility analogous to IRBuilder.
class VPBuilder {		class VPBuilder {
VPBasicBlock *BB = nullptr;		VPBasicBlock *BB = nullptr;
VPBasicBlock::iterator InsertPt = VPBasicBlock::iterator();		VPBasicBlock::iterator InsertPt = VPBasicBlock::iterator();

VPInstruction *createInstruction(unsigned Opcode,		VPInstruction *createInstruction(unsigned Opcode,
ArrayRef<VPValue *> Operands) {		ArrayRef<VPValue *> Operands) {
▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines	VPlanPtr buildVPlanWithVPRecipes(
VFRange &Range, SmallPtrSetImpl<Value *> &NeedDef,		VFRange &Range, SmallPtrSetImpl<Value *> &NeedDef,
SmallPtrSetImpl<Instruction *> &DeadInstructions,		SmallPtrSetImpl<Instruction *> &DeadInstructions,
const DenseMap<Instruction , Instruction > &SinkAfter);		const DenseMap<Instruction , Instruction > &SinkAfter);

/// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,		/// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,
/// according to the information gathered by Legal when it checked if it is		/// according to the information gathered by Legal when it checked if it is
/// legal to vectorize the loop. This method creates VPlans using VPRecipes.		/// legal to vectorize the loop. This method creates VPlans using VPRecipes.
void buildVPlansWithVPRecipes(unsigned MinVF, unsigned MaxVF);		void buildVPlansWithVPRecipes(unsigned MinVF, unsigned MaxVF);

		/// Adjust the recipes for any inloop reductions. The chain of instructions
		/// leading from the loop exit instr to the phi need to be converted to
		/// reductions, with one operand being vector and the other being the scalar
		/// reduction chain.
		void adjustRecipesForInLoopReductions(VPlanPtr &Plan,
		VPRecipeBuilder &RecipeBuilder);
};		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H		#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableCondStoresVectorization(
"enable-cond-stores-vec", cl::init(true), cl::Hidden,		"enable-cond-stores-vec", cl::init(true), cl::Hidden,
cl::desc("Enable if predication of stores during vectorization."));		cl::desc("Enable if predication of stores during vectorization."));

static cl::opt<unsigned> MaxNestedScalarReductionIC(		static cl::opt<unsigned> MaxNestedScalarReductionIC(
"max-nested-scalar-reduction-interleave", cl::init(2), cl::Hidden,		"max-nested-scalar-reduction-interleave", cl::init(2), cl::Hidden,
cl::desc("The maximum interleave count to use when interleaving a scalar "		cl::desc("The maximum interleave count to use when interleaving a scalar "
"reduction in a nested loop."));		"reduction in a nested loop."));

		static cl::opt<bool>
		PreferInLoopReductions("prefer-inloop-reductions", cl::init(false),
		AyalUnsubmitted Done Reply Inline Actions "Force" >> "Prefer"? If a reduction cannot be handled in loop, it is vectorized out of loop, rather than bailing out. Ayal: "Force" >> "Prefer"? If a reduction cannot be handled in loop, it is vectorized out of loop…
		cl::Hidden,
		cl::desc("Prefer in-loop vector reductions, "
		"overriding the targets preference."));

cl::opt<bool> EnableVPlanNativePath(		cl::opt<bool> EnableVPlanNativePath(
"enable-vplan-native-path", cl::init(false), cl::Hidden,		"enable-vplan-native-path", cl::init(false), cl::Hidden,
cl::desc("Enable VPlan-native vectorization path with "		cl::desc("Enable VPlan-native vectorization path with "
"support for outer loop vectorization."));		"support for outer loop vectorization."));

// FIXME: Remove this switch once we have divergence analysis. Currently we		// FIXME: Remove this switch once we have divergence analysis. Currently we
// assume divergent non-backedge branches when this switch is true.		// assume divergent non-backedge branches when this switch is true.
cl::opt<bool> EnableVPlanPredication(		cl::opt<bool> EnableVPlanPredication(
▲ Show 20 Lines • Show All 790 Lines • ▼ Show 20 Lines	public:

/// \return Returns information about the register usages of the loop for the		/// \return Returns information about the register usages of the loop for the
/// given vectorization factors.		/// given vectorization factors.
SmallVector<RegisterUsage, 8> calculateRegisterUsage(ArrayRef<unsigned> VFs);		SmallVector<RegisterUsage, 8> calculateRegisterUsage(ArrayRef<unsigned> VFs);

/// Collect values we want to ignore in the cost model.		/// Collect values we want to ignore in the cost model.
void collectValuesToIgnore();		void collectValuesToIgnore();

		/// Split reductions into those that happen in the loop, and those that happen
		/// outside. In loop reductions are collected into InLoopReductionChains.
		void collectInLoopReductions();
		AyalUnsubmitted Done Reply Inline Actions collectInLoopReductions()? Perhaps worth holding a map of in loop reduction phi's with their chains. Ayal: collectInLoopReductions()? Perhaps worth holding a map of in loop reduction phi's with their…
		AyalUnsubmitted Done Reply Inline Actions Thanks. Worth updating the comment. Ayal: Thanks. Worth updating the comment.

/// \returns The smallest bitwidth each instruction can be represented with.		/// \returns The smallest bitwidth each instruction can be represented with.
/// The vector equivalents of these instructions should be truncated to this		/// The vector equivalents of these instructions should be truncated to this
/// type.		/// type.
const MapVector<Instruction *, uint64_t> &getMinimalBitwidths() const {		const MapVector<Instruction *, uint64_t> &getMinimalBitwidths() const {
return MinBWs;		return MinBWs;
}		}

/// \returns True if it is more profitable to scalarize instruction \p I for		/// \returns True if it is more profitable to scalarize instruction \p I for
▲ Show 20 Lines • Show All 251 Lines • ▼ Show 20 Lines	public:

/// Returns true if all loop blocks should be masked to fold tail loop.		/// Returns true if all loop blocks should be masked to fold tail loop.
bool foldTailByMasking() const { return FoldTailByMasking; }		bool foldTailByMasking() const { return FoldTailByMasking; }

bool blockNeedsPredication(BasicBlock *BB) {		bool blockNeedsPredication(BasicBlock *BB) {
return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);		return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);
}		}

		/// A SmallMapVector to store the InLoop reduction op chains, mapping phi
		/// nodes to the chain of instructions representing the reductions. Uses a
		AyalUnsubmitted Done Reply Inline Actions isInLoopReduction(Phi)? Ayal: isInLoopReduction(Phi)?
		/// MapVector to ensure deterministic iteration order.
		using ReductionChainMap =
		SmallMapVector<PHINode , SmallVector<Instruction , 4>, 4>;

		/// Return the chain of instructions representing an inloop reduction.
		AyalUnsubmitted Done Reply Inline Actions Simply call InLoopReductions.empty() directly? Ayal: Simply call InLoopReductions.empty() directly?
		const ReductionChainMap &getInLoopReductionChains() const {
		return InLoopReductionChains;
		}
		AyalUnsubmitted Done Reply Inline Actions hasOutOfLoopReductions()? Outerloop stands for a non innermost loop in a nest. Ayal: hasOutOfLoopReductions()? Outerloop stands for a non innermost loop in a nest.

		/// Returns true if the Phi is part of an inloop reduction.
		bool isInLoopReduction(PHINode *Phi) const {
		return InLoopReductionChains.count(Phi);
		}

/// Estimate cost of an intrinsic call instruction CI if it were vectorized		/// Estimate cost of an intrinsic call instruction CI if it were vectorized
/// with factor VF. Return the cost of the instruction, including		/// with factor VF. Return the cost of the instruction, including
/// scalarization overhead if it's needed.		/// scalarization overhead if it's needed.
unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF);		unsigned getVectorIntrinsicCost(CallInst *CI, unsigned VF);

/// Estimate cost of a call instruction CI if it were vectorized with factor		/// Estimate cost of a call instruction CI if it were vectorized with factor
/// VF. Return the cost of the instruction, including scalarization overhead		/// VF. Return the cost of the instruction, including scalarization overhead
/// if it's needed. The flag NeedToScalarize shows if the call needs to be		/// if it's needed. The flag NeedToScalarize shows if the call needs to be
▲ Show 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	private:
/// Holds the instructions known to be scalar after vectorization.		/// Holds the instructions known to be scalar after vectorization.
/// The data is collected per VF.		/// The data is collected per VF.
DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> Scalars;		DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> Scalars;

/// Holds the instructions (address computations) that are forced to be		/// Holds the instructions (address computations) that are forced to be
/// scalarized.		/// scalarized.
DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> ForcedScalars;		DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> ForcedScalars;

		/// PHINodes of the reductions that should be expanded in-loop along with
		AyalUnsubmitted Done Reply Inline Actions ... along with their associated chains of reduction operations, in program order from top (PHI) to bottom? Ayal: ... along with their associated chains of reduction operations, in program order from top (PHI)…
		/// their associated chains of reduction operations, in program order from top
		/// (PHI) to bottom
		ReductionChainMap InLoopReductionChains;

/// Returns the expected difference in cost from scalarizing the expression		/// Returns the expected difference in cost from scalarizing the expression
/// feeding a predicated instruction \p PredInst. The instructions to		/// feeding a predicated instruction \p PredInst. The instructions to
/// scalarize and their scalar costs are collected in \p ScalarCosts. A		/// scalarize and their scalar costs are collected in \p ScalarCosts. A
/// non-negative return value implies the expression will be scalarized.		/// non-negative return value implies the expression will be scalarized.
/// Currently, only single-use chains are considered for scalarization.		/// Currently, only single-use chains are considered for scalarization.
int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,		int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
unsigned VF);		unsigned VF);

▲ Show 20 Lines • Show All 2,313 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::fixReduction(PHINode *Phi) {
RecurrenceDescriptor RdxDesc = Legal->getReductionVars()[Phi];		RecurrenceDescriptor RdxDesc = Legal->getReductionVars()[Phi];

RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind();		RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind();
TrackingVH<Value> ReductionStartValue = RdxDesc.getRecurrenceStartValue();		TrackingVH<Value> ReductionStartValue = RdxDesc.getRecurrenceStartValue();
Instruction *LoopExitInst = RdxDesc.getLoopExitInstr();		Instruction *LoopExitInst = RdxDesc.getLoopExitInstr();
RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind =		RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind =
RdxDesc.getMinMaxRecurrenceKind();		RdxDesc.getMinMaxRecurrenceKind();
setDebugLocFromInst(Builder, ReductionStartValue);		setDebugLocFromInst(Builder, ReductionStartValue);
		bool IsInLoopReductionPhi = Cost->isInLoopReduction(Phi);
		AyalUnsubmitted Done Reply Inline Actions isInLoopReductionPhi Ayal: isInLoopReductionPhi
		AyalUnsubmitted Done Reply Inline Actions Ahh, this should actually be capitalized `IsInLoopReductionPhi` Ayal: Ahh, this should actually be capitalized `IsInLoopReductionPhi`

// We need to generate a reduction vector from the incoming scalar.		// We need to generate a reduction vector from the incoming scalar.
// To do so, we need to generate the 'identity' vector and override		// To do so, we need to generate the 'identity' vector and override
// one of the elements with the incoming scalar reduction. We need		// one of the elements with the incoming scalar reduction. We need
// to do it in the vector-loop preheader.		// to do it in the vector-loop preheader.
Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());		Builder.SetInsertPoint(LoopVectorPreHeader->getTerminator());

// This is the vector-clone of the value that leaves the loop.		// This is the vector-clone of the value that leaves the loop.
Type *VecTy = getOrCreateVectorValue(LoopExitInst, 0)->getType();		Type *VecTy = getOrCreateVectorValue(LoopExitInst, 0)->getType();

// Find the reduction identity variable. Zero for addition, or, xor,		// Find the reduction identity variable. Zero for addition, or, xor,
// one for multiplication, -1 for And.		// one for multiplication, -1 for And.
Value *Identity;		Value *Identity;
Value *VectorStart;		Value *VectorStart;
if (RK == RecurrenceDescriptor::RK_IntegerMinMax \|\|		if (RK == RecurrenceDescriptor::RK_IntegerMinMax \|\|
RK == RecurrenceDescriptor::RK_FloatMinMax) {		RK == RecurrenceDescriptor::RK_FloatMinMax) {
// MinMax reduction have the start value as their identify.		// MinMax reduction have the start value as their identify.
if (VF == 1) {		if (VF == 1 \|\| IsInLoopReductionPhi) {
		AyalUnsubmitted Done Reply Inline Actions This is dead code if cmp/select chains are not recognized yet, as noted above. Ayal: This is dead code if cmp/select chains are not recognized yet, as noted above.
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I've added the code to handle minmax too (but not tested it a lot yet. I will try that now). MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.) My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away. dmgreen: I've added the code to handle minmax too (but not tested it a lot yet. I will try that now).
		AyalUnsubmitted Not Done Reply Inline Actions Would be good to make sure code is being exercised and tested. Could inloop min/max (and/or other reductions) help reduce code size, and be applied when vectorizing under optsize? Ayal: Would be good to make sure code is being exercised and tested. Could inloop min/max (and/or…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions -Os sounds like a good plan. It will take some backend work to make it efficient enough first though. And predicated reductions? dmgreen: -Os sounds like a good plan. It will take some backend work to make it efficient enough first…
		AyalUnsubmitted Not Done Reply Inline Actions Hoisting the horizontal reduction from the middle block into the loop could potentially eliminate the middle block (as in tests below), so could presumably lead to code of smaller size? At-least for in-loop chains of a single link. And predicated reductions? These are yet to be handled in-loop, right? Ayal: Hoisting the horizontal reduction from the middle block into the loop could potentially…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions And predicated reductions? These are yet to be handled in-loop, right? Yep. It will need a predicated reduction intrinsic. A vecreduce that takes a mask. That will allow us to tail-fold the reductions with trip counts that do not divide the vector factor, which will make them look a lot better under -Os. And nice in general I think once it all starts being tail predicated. The backend work I was mentioning was that we need to more efficiently transform x = min(vecreduce.min(z), y) into x = VMINV(y, z) Where y is (confusingly) accumulated in the case (even though the instruction doesn't have an A suffix). We currently generate x = min(VMINV(UINT_MAX, z), y) Once that is sorted out then, yep, using these for Os sounds like a good plan. dmgreen: >> And predicated reductions? >These are yet to be handled in-loop, right? Yep. It will need a…
		AyalUnsubmitted Not Done Reply Inline Actions Re: predicated reductions - could they be handled by replacing masked-off elements with `Identity` using a select prior to reduction? To be potentially folded later by suitable targets into a predicated reduction operation which they may support. Somewhat akin to "passthru" values of masked loads. Ayal: Re: predicated reductions - could they be handled by replacing masked-off elements with…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Oh. So select s = select m, a, 0 v = vecreduce.add s to a predicated vaddv? Yeah, sounds interesting. I'll look into that as an alternative to predicated intrinsics. Nice suggestion. dmgreen: Oh. So select s = select m, a, 0 v = vecreduce.add s to a predicated vaddv? Yeah, sounds…
VectorStart = Identity = ReductionStartValue;		VectorStart = Identity = ReductionStartValue;
} else {		} else {
VectorStart = Identity =		VectorStart = Identity =
Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident");		Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident");
}		}
} else {		} else {
// Handle other reduction kinds:		// Handle other reduction kinds:
Constant *Iden = RecurrenceDescriptor::getRecurrenceIdentity(		Constant *Iden = RecurrenceDescriptor::getRecurrenceIdentity(
RK, VecTy->getScalarType());		RK, VecTy->getScalarType());
if (VF == 1) {		if (VF == 1 \|\| IsInLoopReductionPhi) {
Identity = Iden;		Identity = Iden;
// This vector is the Identity vector where the first element is the		// This vector is the Identity vector where the first element is the
// incoming scalar reduction.		// incoming scalar reduction.
VectorStart = ReductionStartValue;		VectorStart = ReductionStartValue;
} else {		} else {
Identity = ConstantVector::getSplat({VF, false}, Iden);		Identity = ConstantVector::getSplat({VF, false}, Iden);

// This vector is the Identity vector where the first element is the		// This vector is the Identity vector where the first element is the
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	for (unsigned Part = 0; Part < UF; ++Part) {
VectorLoopValueMap.resetVectorValue(LoopExitInst, Part, Sel);		VectorLoopValueMap.resetVectorValue(LoopExitInst, Part, Sel);
}		}
}		}

// If the vector reduction can be performed in a smaller type, we truncate		// If the vector reduction can be performed in a smaller type, we truncate
// then extend the loop exit value to enable InstCombine to evaluate the		// then extend the loop exit value to enable InstCombine to evaluate the
// entire expression in the smaller type.		// entire expression in the smaller type.
if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) {		if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) {
		assert(!IsInLoopReductionPhi && "Unexpected truncated inloop reduction!");
		AyalUnsubmitted Done Reply Inline Actions Checking if !UseInLoopReductions is redundant, as current in loop reductions have phi's of the same type as the recurrence; right? May leave it as an assert, until in loop reductions handle truncation/extension. Ayal: Checking if !UseInLoopReductions is redundant, as current in loop reductions have phi's of the…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Right. We look from the phi down (now), so can't go through the 'and' we would need to detect the different bitwidth. That would leave 'and' reduction, which I think would not got through that codepath to create type promoted phis? I've put an explicit check in for it to be sure, and added some testing for those cases. dmgreen: Right. We look from the phi down (now), so can't go through the 'and' we would need to detect…
		AyalUnsubmitted Done Reply Inline Actions 'and' reduction may or may not be performed in a smaller type, iinm. Ayal: 'and' reduction may or may not be performed in a smaller type, iinm.
		dmgreenAuthorUnsubmitted Done Reply Inline Actions The check for lookThroughAnd in AddReductionVar is guarded by isArithmeticRecurrenceKind, so at least currently it cannot be an And reduction. dmgreen: The check for lookThroughAnd in AddReductionVar is guarded by isArithmeticRecurrenceKind, so at…
Type *RdxVecTy = FixedVectorType::get(RdxDesc.getRecurrenceType(), VF);		Type *RdxVecTy = FixedVectorType::get(RdxDesc.getRecurrenceType(), VF);
Builder.SetInsertPoint(		Builder.SetInsertPoint(
LI->getLoopFor(LoopVectorBody)->getLoopLatch()->getTerminator());		LI->getLoopFor(LoopVectorBody)->getLoopLatch()->getTerminator());
VectorParts RdxParts(UF);		VectorParts RdxParts(UF);
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
RdxParts[Part] = VectorLoopValueMap.getVectorValue(LoopExitInst, Part);		RdxParts[Part] = VectorLoopValueMap.getVectorValue(LoopExitInst, Part);
Value *Trunc = Builder.CreateTrunc(RdxParts[Part], RdxVecTy);		Value *Trunc = Builder.CreateTrunc(RdxParts[Part], RdxVecTy);
Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy)		Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy)
Show All 34 Lines	if (Op != Instruction::ICmp && Op != Instruction::FCmp)
Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxPart,		Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxPart,
ReducedPartRdx, "bin.rdx"),		ReducedPartRdx, "bin.rdx"),
RdxDesc.getFastMathFlags());		RdxDesc.getFastMathFlags());
else		else
ReducedPartRdx = createMinMaxOp(Builder, MinMaxKind, ReducedPartRdx,		ReducedPartRdx = createMinMaxOp(Builder, MinMaxKind, ReducedPartRdx,
RdxPart);		RdxPart);
}		}

if (VF > 1) {		// Create the reduction after the loop. Note that inloop reductions create the
		// target reduction in the loop using a Reduction recipe.
		if (VF > 1 && !IsInLoopReductionPhi) {
		AyalUnsubmitted Done Reply Inline Actions Worth adding a comment that in loop reductions create their target reductions inside the loop using a recipe, rather that here in the middle block. Ayal: Worth adding a comment that in loop reductions create their target reductions inside the loop…
bool NoNaN = Legal->hasFunNoNaNAttr();		bool NoNaN = Legal->hasFunNoNaNAttr();
ReducedPartRdx =		ReducedPartRdx =
createTargetReduction(Builder, TTI, RdxDesc, ReducedPartRdx, NoNaN);		createTargetReduction(Builder, TTI, RdxDesc, ReducedPartRdx, NoNaN);
// If the reduction can be performed in a smaller type, we need to extend		// If the reduction can be performed in a smaller type, we need to extend
// the reduction to the wider type before we branch to the original loop.		// the reduction to the wider type before we branch to the original loop.
if (Phi->getType() != RdxDesc.getRecurrenceType())		if (Phi->getType() != RdxDesc.getRecurrenceType())
ReducedPartRdx =		ReducedPartRdx =
RdxDesc.isSigned()		RdxDesc.isSigned()
▲ Show 20 Lines • Show All 279 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN, unsigned UF,

// In order to support recurrences we need to be able to vectorize Phi nodes.		// In order to support recurrences we need to be able to vectorize Phi nodes.
// Phi nodes have cycles, so we need to vectorize them in two stages. This is		// Phi nodes have cycles, so we need to vectorize them in two stages. This is
// stage #1: We create a new vector PHI node with no incoming edges. We'll use		// stage #1: We create a new vector PHI node with no incoming edges. We'll use
// this value when we vectorize all of the instructions that use the PHI.		// this value when we vectorize all of the instructions that use the PHI.
if (Legal->isReductionVariable(P) \|\| Legal->isFirstOrderRecurrence(P)) {		if (Legal->isReductionVariable(P) \|\| Legal->isFirstOrderRecurrence(P)) {
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
// This is phase one of vectorizing PHIs.		// This is phase one of vectorizing PHIs.
		bool ScalarPHI = (VF == 1) \|\| Cost->isInLoopReduction(cast<PHINode>(PN));
Type *VecTy =		Type *VecTy =
(VF == 1) ? PN->getType() : FixedVectorType::get(PN->getType(), VF);		ScalarPHI ? PN->getType() : FixedVectorType::get(PN->getType(), VF);
		AyalUnsubmitted Done Reply Inline Actions Check in-loop reduction along with VF == 1, instead of modifying VF, as in other places? E.g., bool ScalarPHI = (VF == 1) \|\| Cost->isInLoopReduction(cast<PHINode>(PN)); Type VecTy = ScalarPHI ? PN->getType() : VectorType::get(PN->getType(), VF); ? Ayal:* Check in-loop reduction along with VF == 1, instead of modifying VF, as in other places? E.g.
Value *EntryPart = PHINode::Create(		Value *EntryPart = PHINode::Create(
VecTy, 2, "vec.phi", &*LoopVectorBody->getFirstInsertionPt());		VecTy, 2, "vec.phi", &*LoopVectorBody->getFirstInsertionPt());
VectorLoopValueMap.setVectorValue(P, Part, EntryPart);		VectorLoopValueMap.setVectorValue(P, Part, EntryPart);
}		}
return;		return;
}		}

setDebugLocFromInst(Builder, P);		setDebugLocFromInst(Builder, P);
▲ Show 20 Lines • Show All 2,390 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
// detection.		// detection.
for (auto &Induction : Legal->getInductionVars()) {		for (auto &Induction : Legal->getInductionVars()) {
InductionDescriptor &IndDes = Induction.second;		InductionDescriptor &IndDes = Induction.second;
const SmallVectorImpl<Instruction *> &Casts = IndDes.getCastInsts();		const SmallVectorImpl<Instruction *> &Casts = IndDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}
}		}

		void LoopVectorizationCostModel::collectInLoopReductions() {
		// For the moment, without predicated reduction instructions, we do not
		// support inloop reductions whilst folding the tail, and hence in those cases
		// all reductions are currently out of the loop.
		if (!PreferInLoopReductions \|\| foldTailByMasking())
		return;

		for (auto &Reduction : Legal->getReductionVars()) {
		PHINode *Phi = Reduction.first;
		RecurrenceDescriptor &RdxDesc = Reduction.second;

		// We don't collect reductions that are type promoted (yet).
		if (RdxDesc.getRecurrenceType() != Phi->getType())
		continue;
		AyalUnsubmitted Done Reply Inline Actions Move this invariant check out as another early-exit? Ayal: Move this invariant check out as another early-exit?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions This does look a little strange here on it's own. The followup patch to add the TTI hook makes it look like: if (!PreferInloopReductions && !TTI.preferInloopReduction(Opcode, Phi->getType(), TargetTransformInfo::ReductionFlags())) continue; dmgreen: This does look a little strange here on it's own. The followup patch to add the TTI hook makes…
		AyalUnsubmitted Done Reply Inline Actions Then better placed above right after defining Phi? Ayal: Then better placed above right after defining Phi?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I can put it somewhere sensible in this patch and move it in the next :) dmgreen: I can put it somewhere sensible in this patch and move it in the next :)

		// Check that we can correctly put the reductions into the loop, by
		// finding the chain of operations that leads from the phi to the loop
		// exit value.
		AyalUnsubmitted Done Reply Inline Actions nit: "... that leads from the loop exit value back.." - chain is now found top-down. Ayal: nit: "... that leads from the loop exit value back.." - chain is now found top-down.
		SmallVector<Instruction *, 4> ReductionOperations =
		RdxDesc.getReductionOpChain(Phi, TheLoop);
		bool InLoop = !ReductionOperations.empty();
		if (InLoop)
		InLoopReductionChains[Phi] = ReductionOperations;
		LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop")
		<< " reduction for phi: " << *Phi << "\n");
		}
		}

// TODO: we could return a pair of values that specify the max VF and		// TODO: we could return a pair of values that specify the max VF and
// min VF, to be used in `buildVPlans(MinVF, MaxVF)` instead of		// min VF, to be used in `buildVPlans(MinVF, MaxVF)` instead of
// `buildVPlans(VF, VF)`. We cannot do it because VPLAN at the moment		// `buildVPlans(VF, VF)`. We cannot do it because VPLAN at the moment
// doesn't have a cost model that can choose which plan to execute if		// doesn't have a cost model that can choose which plan to execute if
		AyalUnsubmitted Done Reply Inline Actions The "else" clause will be empty in non-debug mode. Can hoist the LLVM_DEBUG's, e.g., bool InLoop = !ReductionOperations.empty(); LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop") << " reduction for phi: " << Phi << "\n"); Ayal:* The "else" clause will be empty in non-debug mode. Can hoist the LLVM_DEBUG's, e.g., ``` bool…
// more than one is generated.		// more than one is generated.
static unsigned determineVPlanVF(const unsigned WidestVectorRegBits,		static unsigned determineVPlanVF(const unsigned WidestVectorRegBits,
LoopVectorizationCostModel &CM) {		LoopVectorizationCostModel &CM) {
unsigned WidestType;		unsigned WidestType;
std::tie(std::ignore, WidestType) = CM.getSmallestAndWidestTypes();		std::tie(std::ignore, WidestType) = CM.getSmallestAndWidestTypes();
return WidestVectorRegBits / WidestType;		return WidestVectorRegBits / WidestType;
}		}

▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	Optional<VectorizationFactor> LoopVectorizationPlanner::plan(unsigned UserVF,
}		}

if (UserVF) {		if (UserVF) {
LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
CM.selectUserVectorizationFactor(UserVF);		CM.selectUserVectorizationFactor(UserVF);
		CM.collectInLoopReductions();
buildVPlansWithVPRecipes(UserVF, UserVF);		buildVPlansWithVPRecipes(UserVF, UserVF);
LLVM_DEBUG(printPlans(dbgs()));		LLVM_DEBUG(printPlans(dbgs()));
return {{UserVF, 0}};		return {{UserVF, 0}};
}		}

unsigned MaxVF = MaybeMaxVF.getValue();		unsigned MaxVF = MaybeMaxVF.getValue();
assert(MaxVF != 0 && "MaxVF is zero.");		assert(MaxVF != 0 && "MaxVF is zero.");

for (unsigned VF = 1; VF <= MaxVF; VF *= 2) {		for (unsigned VF = 1; VF <= MaxVF; VF *= 2) {
// Collect Uniform and Scalar instructions after vectorization with VF.		// Collect Uniform and Scalar instructions after vectorization with VF.
CM.collectUniformsAndScalars(VF);		CM.collectUniformsAndScalars(VF);

// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
if (VF > 1)		if (VF > 1)
CM.collectInstsToScalarize(VF);		CM.collectInstsToScalarize(VF);
}		}

		CM.collectInLoopReductions();

buildVPlansWithVPRecipes(1, MaxVF);		buildVPlansWithVPRecipes(1, MaxVF);
LLVM_DEBUG(printPlans(dbgs()));		LLVM_DEBUG(printPlans(dbgs()));
if (MaxVF == 1)		if (MaxVF == 1)
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();

// Select the optimal vectorization factor.		// Select the optimal vectorization factor.
return CM.selectVectorizationFactor(MaxVF);		return CM.selectVectorizationFactor(MaxVF);
}		}
▲ Show 20 Lines • Show All 622 Lines • ▼ Show 20 Lines	VPlanPtr LoopVectorizationPlanner::buildVPlanWithVPRecipes(
// ---------------------------------------------------------------------------		// ---------------------------------------------------------------------------

// Mark instructions we'll need to sink later and their targets as		// Mark instructions we'll need to sink later and their targets as
// ingredients whose recipe we'll need to record.		// ingredients whose recipe we'll need to record.
for (auto &Entry : SinkAfter) {		for (auto &Entry : SinkAfter) {
RecipeBuilder.recordRecipeOf(Entry.first);		RecipeBuilder.recordRecipeOf(Entry.first);
RecipeBuilder.recordRecipeOf(Entry.second);		RecipeBuilder.recordRecipeOf(Entry.second);
}		}
		for (auto &Reduction : CM.getInLoopReductionChains()) {
		PHINode *Phi = Reduction.first;
		RecurrenceDescriptor::RecurrenceKind Kind =
		AyalUnsubmitted Not Done Reply Inline Actions Iterate over in loop reductions? Ayal: Iterate over in loop reductions?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop? It would seem like it's similar to the existing code, but as a new iterator class. My gut says the current code is simpler and clearer what is going on? dmgreen: Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop?
		AyalUnsubmitted Done Reply Inline Actions Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars(). (Better early-exit via "if (!CM.isInLoopReduction(Reduction.first)) continue;") Ayal: Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I believe that InloopReductionChains would not iterate in a deterministic order, which is why I avoided it. Perhaps that would not matter here? The reductions should be independent anyway. Seems safer to try and use deterministic ordering anyway if we can. dmgreen: I believe that InloopReductionChains would not iterate in a deterministic order, which is why I…
		AyalUnsubmitted Done Reply Inline Actions Agreed it would be better to use deterministic ordering. How about letting InloopReductionChains be a MapVector and iterate over for (auto &Reduction : CM.getInloopReductions())? The number of reductions is expected to be small, w/o removals. Ayal: Agreed it would be better to use deterministic ordering. How about letting…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions MapVector sounds good. I've changed it to use that and tried to use that in a few more places. Let me know what you think. dmgreen: MapVector sounds good. I've changed it to use that and tried to use that in a few more places.
		AyalUnsubmitted Not Done Reply Inline Actions Uses as MapVector look good to me, thanks. Can also retain isInLoopReduction(PHI). Ayal: Uses as MapVector look good to me, thanks. Can also retain isInLoopReduction(PHI).
		Legal->getReductionVars()[Phi].getRecurrenceKind();
		const SmallVector<Instruction *, 4> &ReductionOperations = Reduction.second;

		RecipeBuilder.recordRecipeOf(Phi);
		for (auto &R : ReductionOperations) {
		RecipeBuilder.recordRecipeOf(R);
		AyalUnsubmitted Not Done Reply Inline Actions recode >> record Ayal: recode >> record
		// For min/max reducitons, where we have a pair of icmp/select, we also
		// need to record the ICmp recipe, so it can be removed later.
		if (Kind == RecurrenceDescriptor::RK_IntegerMinMax \|\|
		AyalUnsubmitted Done Reply Inline Actions Add a comment that this is the compare operand of the select, whose recipe will also be discarded. Ayal: Add a comment that this is the compare operand of the select, whose recipe will also be…
		Kind == RecurrenceDescriptor::RK_FloatMinMax) {
		RecipeBuilder.recordRecipeOf(cast<Instruction>(R->getOperand(0)));
		}
		AyalUnsubmitted Done Reply Inline Actions nit: can record the recipe of Phi first, just to follow chain order. Ayal: nit: can record the recipe of Phi first, just to follow chain order.
		}
		}

// For each interleave group which is relevant for this (possibly trimmed)		// For each interleave group which is relevant for this (possibly trimmed)
// Range, add it to the set of groups to be later applied to the VPlan and add		// Range, add it to the set of groups to be later applied to the VPlan and add
// placeholders for its members' Recipes which we'll be replacing with a		// placeholders for its members' Recipes which we'll be replacing with a
// single VPInterleaveRecipe.		// single VPInterleaveRecipe.
for (InterleaveGroup<Instruction> *IG : IAI.getInterleaveGroups()) {		for (InterleaveGroup<Instruction> *IG : IAI.getInterleaveGroups()) {
auto applyIG = [IG, this](unsigned VF) -> bool {		auto applyIG = [IG, this](unsigned VF) -> bool {
return (VF >= 2 && // Query is illegal for VF == 1		return (VF >= 2 && // Query is illegal for VF == 1
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	(new VPInterleaveRecipe(IG, Recipe->getAddr(), Recipe->getMask()))
->insertBefore(Recipe);		->insertBefore(Recipe);

for (unsigned i = 0; i < IG->getFactor(); ++i)		for (unsigned i = 0; i < IG->getFactor(); ++i)
if (Instruction *Member = IG->getMember(i)) {		if (Instruction *Member = IG->getMember(i)) {
RecipeBuilder.getRecipe(Member)->eraseFromParent();		RecipeBuilder.getRecipe(Member)->eraseFromParent();
}		}
}		}

		// Adjust the recipes for any inloop reductions.
		if (Range.Start > 1)
		adjustRecipesForInLoopReductions(Plan, RecipeBuilder);

// Finally, if tail is folded by masking, introduce selects between the phi		// Finally, if tail is folded by masking, introduce selects between the phi
		AyalUnsubmitted Done Reply Inline Actions Consider outlining this part, unless it can be shortened. Ayal: Consider outlining this part, unless it can be shortened.
// and the live-out instruction of each reduction, at the end of the latch.		// and the live-out instruction of each reduction, at the end of the latch.
if (CM.foldTailByMasking() && !Legal->getReductionVars().empty()) {		if (CM.foldTailByMasking() && !Legal->getReductionVars().empty()) {
		AyalUnsubmitted Done Reply Inline Actions In loop reductions are currently disabled in fold tail. Checking also if hasOuterloopReductions() is an independent optimization that should be done separately of this patch. Ayal: In loop reductions are currently disabled in fold tail. Checking also if hasOuterloopReductions…
		AyalUnsubmitted Not Done Reply Inline Actions Checking also if hasOuterloopReductions() is an independent optimization that should be done separately of this patch. Ayal: Checking also if hasOuterloopReductions() is an independent optimization that should be done…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Ah sorry. I split it off but apparently didn't upstream the other part yet. Now in D81415. dmgreen: Ah sorry. I split it off but apparently didn't upstream the other part yet. Now in D81415.
Builder.setInsertPoint(VPBB);		Builder.setInsertPoint(VPBB);
auto *Cond = RecipeBuilder.createBlockInMask(OrigLoop->getHeader(), Plan);		auto *Cond = RecipeBuilder.createBlockInMask(OrigLoop->getHeader(), Plan);
		AyalUnsubmitted Done Reply Inline Actions Iterate over in loop reductions? Comment that they are ordered top-down, starting from the phi. Ayal: Iterate over in loop reductions? Comment that they are ordered top-down, starting from the phi.
		AyalUnsubmitted Done Reply Inline Actions Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars(). Ayal: Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all…
for (auto &Reduction : Legal->getReductionVars()) {		for (auto &Reduction : Legal->getReductionVars()) {
		assert(!CM.isInLoopReduction(Reduction.first) &&
		"Didn't expect inloop tail folded reduction yet!");
		AyalUnsubmitted Done Reply Inline Actions This is currently dead code, no in loop reductions under fold tail. Ayal: This is currently dead code, no in loop reductions under fold tail.
VPValue *Phi = Plan->getVPValue(Reduction.first);		VPValue *Phi = Plan->getVPValue(Reduction.first);
VPValue *Red = Plan->getVPValue(Reduction.second.getLoopExitInstr());		VPValue *Red = Plan->getVPValue(Reduction.second.getLoopExitInstr());
Builder.createNaryOp(Instruction::Select, {Cond, Red, Phi});		Builder.createNaryOp(Instruction::Select, {Cond, Red, Phi});
}		}
}		}

std::string PlanName;		std::string PlanName;
raw_string_ostream RSO(PlanName);		raw_string_ostream RSO(PlanName);
unsigned VF = Range.Start;		unsigned VF = Range.Start;
Plan->addVF(VF);		Plan->addVF(VF);
RSO << "Initial VPlan for VF={" << VF;		RSO << "Initial VPlan for VF={" << VF;
for (VF = 2; VF < Range.End; VF = 2) {		for (VF = 2; VF < Range.End; VF = 2) {
Plan->addVF(VF);		Plan->addVF(VF);
RSO << "," << VF;		RSO << "," << VF;
}		}
RSO << "},UF>=1";		RSO << "},UF>=1";
RSO.flush();		RSO.flush();
Plan->setName(PlanName);		Plan->setName(PlanName);

return Plan;		return Plan;
}		}

VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {		VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
// Outer loop handling: They may require CFG and instruction level		// Outer loop handling: They may require CFG and instruction level
		AyalUnsubmitted Done Reply Inline Actions `ChainOp` can be simply set: VPValue ChainOp = Plan->getVPValue(Chain); leaving only VecOp to figure out which operand index to use. Can preset the options before the loop: RecurrenceDescriptor::RecurrenceKind Kind = RdxDesc.getRecurrenceKind(); bool InMinMax = (Kind == RecurrenceDescriptor::RK_IntegerMinMax \|\| Kind == RecurrenceDescriptor::RK_FloatMinMax); FirstOpId = IsMinMax ? 1 : 0; SecondOpId = FirstOpId + 1; and then do auto FirstOp = R->getOperand(FirstOpId); auto SecondOp = R->getOperand(SecondOpId); VPValue VecOp = Plan->getVPValue(FirstOp == Chain ? SecondOp : FirstOp); Ayal: `ChainOp `can be simply set: ``` VPValue *ChainOp = Plan->getVPValue(Chain); ``` leaving only…
// transformations before even evaluating whether vectorization is profitable.		// transformations before even evaluating whether vectorization is profitable.
// Since we cannot modify the incoming IR, we need to build VPlan upfront in		// Since we cannot modify the incoming IR, we need to build VPlan upfront in
// the vectorization pipeline.		// the vectorization pipeline.
assert(!OrigLoop->empty());		assert(!OrigLoop->empty());
assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");		assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");

// Create new empty VPlan		// Create new empty VPlan
auto Plan = std::make_unique<VPlan>();		auto Plan = std::make_unique<VPlan>();
Show All 15 Lines	VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
}		}

SmallPtrSet<Instruction *, 1> DeadInstructions;		SmallPtrSet<Instruction *, 1> DeadInstructions;
VPlanTransforms::VPInstructionsToVPRecipes(		VPlanTransforms::VPInstructionsToVPRecipes(
OrigLoop, Plan, Legal->getInductionVars(), DeadInstructions);		OrigLoop, Plan, Legal->getInductionVars(), DeadInstructions);
return Plan;		return Plan;
}		}

		// Adjust the recipes for any inloop reductions. The chain of instructions
		// leading from the loop exit instr to the phi need to be converted to
		// reductions, with one operand being vector and the other being the scalar
		// reduction chain.
		void LoopVectorizationPlanner::adjustRecipesForInLoopReductions(
		VPlanPtr &Plan, VPRecipeBuilder &RecipeBuilder) {
		for (auto &Reduction : CM.getInLoopReductionChains()) {
		AyalUnsubmitted Done Reply Inline Actions This is the other potential use of for (auto &Reduction : CM.getInloopReductions()). Ayal: This is the other potential use of for (auto &Reduction : CM.getInloopReductions()).
		PHINode *Phi = Reduction.first;
		RecurrenceDescriptor &RdxDesc = Legal->getReductionVars()[Phi];
		const SmallVector<Instruction *, 4> &ReductionOperations = Reduction.second;

		// ReductionOperations are orders top-down from the phi's use to the
		// LoopExitValue. We keep a track of the previous item (the Chain) to tell
		// which of the two operands will remain scalar and which will be reduced.
		// For minmax the chain will be the select instructions.
		Instruction *Chain = Phi;
		for (Instruction *R : ReductionOperations) {
		VPRecipeBase *WidenRecipe = RecipeBuilder.getRecipe(R);
		RecurrenceDescriptor::RecurrenceKind Kind = RdxDesc.getRecurrenceKind();

		VPValue *ChainOp = Plan->getVPValue(Chain);
		unsigned FirstOpId;
		if (Kind == RecurrenceDescriptor::RK_IntegerMinMax \|\|
		Kind == RecurrenceDescriptor::RK_FloatMinMax) {
		assert(WidenRecipe->getVPRecipeID() == VPRecipeBase::VPWidenSelectSC &&
		"Expected to replace a VPWidenSelectSC");
		FirstOpId = 1;
		} else {
		assert(WidenRecipe->getVPRecipeID() == VPRecipeBase::VPWidenSC &&
		"Expected to replace a VPWidenSC");
		FirstOpId = 0;
		}
		unsigned VecOpId =
		R->getOperand(FirstOpId) == Chain ? FirstOpId + 1 : FirstOpId;
		VPValue *VecOp = Plan->getVPValue(R->getOperand(VecOpId));

		VPReductionRecipe *RedRecipe = new VPReductionRecipe(
		&RdxDesc, R, ChainOp, VecOp, Legal->hasFunNoNaNAttr(), TTI);
		WidenRecipe->getParent()->insert(RedRecipe, WidenRecipe->getIterator());
		WidenRecipe->removeFromParent();

		if (Kind == RecurrenceDescriptor::RK_IntegerMinMax \|\|
		Kind == RecurrenceDescriptor::RK_FloatMinMax) {
		VPRecipeBase *CompareRecipe =
		RecipeBuilder.getRecipe(cast<Instruction>(R->getOperand(0)));
		AyalUnsubmitted Not Done Reply Inline Actions nit: can assert CompareRecipe->getVPRecipeID() Ayal: nit: can assert CompareRecipe->getVPRecipeID()
		assert(CompareRecipe->getVPRecipeID() == VPRecipeBase::VPWidenSC &&
		"Expected to replace a VPWidenSC");
		CompareRecipe->removeFromParent();
		}
		Chain = R;
		}
		}
		}

Value* LoopVectorizationPlanner::VPCallbackILV::		Value* LoopVectorizationPlanner::VPCallbackILV::
getOrCreateVectorValues(Value *V, unsigned Part) {		getOrCreateVectorValues(Value *V, unsigned Part) {
return ILV.getOrCreateVectorValue(V, Part);		return ILV.getOrCreateVectorValue(V, Part);
}		}

Value *LoopVectorizationPlanner::VPCallbackILV::getOrCreateScalarValue(		Value *LoopVectorizationPlanner::VPCallbackILV::getOrCreateScalarValue(
Value *V, const VPIteration &Instance) {		Value *V, const VPIteration &Instance) {
return ILV.getOrCreateScalarValue(V, Instance);		return ILV.getOrCreateScalarValue(V, Instance);
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	for (unsigned Part = 0; Part < State.UF; ++Part)
State.ValueMap.setVectorValue(Phi, Part, Entry[Part]);		State.ValueMap.setVectorValue(Phi, Part, Entry[Part]);
}		}

void VPInterleaveRecipe::execute(VPTransformState &State) {		void VPInterleaveRecipe::execute(VPTransformState &State) {
assert(!State.Instance && "Interleave group being replicated.");		assert(!State.Instance && "Interleave group being replicated.");
State.ILV->vectorizeInterleaveGroup(IG, State, getAddr(), getMask());		State.ILV->vectorizeInterleaveGroup(IG, State, getAddr(), getMask());
}		}

		void VPReductionRecipe::execute(VPTransformState &State) {
		assert(!State.Instance && "Reduction being replicated.");
		for (unsigned Part = 0; Part < State.UF; ++Part) {
		unsigned Kind = RdxDesc->getRecurrenceKind();
		Value *NewVecOp = State.get(VecOp, Part);
		Value *NewRed =
		createTargetReduction(State.Builder, TTI, *RdxDesc, NewVecOp, NoNaN);
		Value *PrevInChain = State.get(ChainOp, Part);
		Value *NextInChain;
		AyalUnsubmitted Done Reply Inline Actions NewChain, NewAcc >> PrevInChain, NextInChain? Ayal: NewChain, NewAcc >> PrevInChain, NextInChain?
		if (Kind == RecurrenceDescriptor::RK_IntegerMinMax \|\|
		Kind == RecurrenceDescriptor::RK_FloatMinMax) {
		NextInChain =
		createMinMaxOp(State.Builder, RdxDesc->getMinMaxRecurrenceKind(),
		NewRed, PrevInChain);
		} else {
		NextInChain = State.Builder.CreateBinOp(
		(Instruction::BinaryOps)I->getOpcode(), NewRed, PrevInChain);
		}
		State.ValueMap.setVectorValue(I, Part, NextInChain);
		}
		}

void VPReplicateRecipe::execute(VPTransformState &State) {		void VPReplicateRecipe::execute(VPTransformState &State) {
if (State.Instance) { // Generate a single instance.		if (State.Instance) { // Generate a single instance.
State.ILV->scalarizeInstruction(Ingredient, User, *State.Instance,		State.ILV->scalarizeInstruction(Ingredient, User, *State.Instance,
IsPredicated, State);		IsPredicated, State);
// Insert scalar instance packing it into a vector.		// Insert scalar instance packing it into a vector.
if (AlsoPack && State.VF > 1) {		if (AlsoPack && State.VF > 1) {
// If we're constructing lane 0, initialize to start from undef.		// If we're constructing lane 0, initialize to start from undef.
if (State.Instance->Lane == 0) {		if (State.Instance->Lane == 0) {
▲ Show 20 Lines • Show All 590 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.h

Show All 34 Lines
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/ADT/ilist.h"		#include "llvm/ADT/ilist.h"
#include "llvm/ADT/ilist_node.h"		#include "llvm/ADT/ilist_node.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include <algorithm>		#include <algorithm>
		fhahnUnsubmitted Done Reply Inline Actions Would a forward decl of ReductionDesciptor be sufficient or is there another need to include LoopUtils.h? fhahn: Would a forward decl of ReductionDesciptor be sufficient or is there another need to include…
#include <cassert>		#include <cassert>
#include <cstddef>		#include <cstddef>
#include <map>		#include <map>
#include <string>		#include <string>

namespace llvm {		namespace llvm {

class BasicBlock;		class BasicBlock;
class DominatorTree;		class DominatorTree;
class InnerLoopVectorizer;		class InnerLoopVectorizer;
template <class T> class InterleaveGroup;		template <class T> class InterleaveGroup;
class LoopInfo;		class LoopInfo;
class raw_ostream;		class raw_ostream;
		class RecurrenceDescriptor;
class Value;		class Value;
class VPBasicBlock;		class VPBasicBlock;
class VPRegionBlock;		class VPRegionBlock;
class VPSlotTracker;		class VPSlotTracker;
class VPlan;		class VPlan;
class VPlanSlp;		class VPlanSlp;

/// A range of powers-of-2 vectorization factors with fixed start and		/// A range of powers-of-2 vectorization factors with fixed start and
▲ Show 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
};		};

/// VPTransformState holds information passed down when "executing" a VPlan,		/// VPTransformState holds information passed down when "executing" a VPlan,
/// needed for generating the output IR.		/// needed for generating the output IR.
struct VPTransformState {		struct VPTransformState {
VPTransformState(unsigned VF, unsigned UF, LoopInfo LI, DominatorTree DT,		VPTransformState(unsigned VF, unsigned UF, LoopInfo LI, DominatorTree DT,
IRBuilder<> &Builder, VectorizerValueMap &ValueMap,		IRBuilder<> &Builder, VectorizerValueMap &ValueMap,
InnerLoopVectorizer *ILV, VPCallback &Callback)		InnerLoopVectorizer *ILV, VPCallback &Callback)
: VF(VF), UF(UF), Instance(), LI(LI), DT(DT), Builder(Builder),		: VF(VF), UF(UF), Instance(), LI(LI), DT(DT), Builder(Builder),
		AyalUnsubmitted Done Reply Inline Actions Too bad this requires passing TTI through the State everywhere. Perhaps storing TTI in the recipe would be somewhat better. Ayal: Too bad this requires passing TTI through the State everywhere. Perhaps storing TTI in the…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I've changed it to be stored there. It does mean multiple things are holding TTI. Let me know what you think. dmgreen: I've changed it to be stored there. It does mean multiple things are holding TTI. Let me know…
		gilrUnsubmitted Not Done Reply Inline Actions It seems that TTI is only used later for deciding whether to use a shuffle sequence or an intrinsic based on data available during planning. If so, then it would be best if the Planner calls TTI->useReductionIntrinsic() and records that boolean decision in the Recipe. This is also required in order to estimate in-loop reduction cost. This could be done separately. gilr: It seems that TTI is only used later for deciding whether to use a shuffle sequence or an…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Do you mean to change the interface to createTargetReduction, to take a bool instead? Yeah I think that sounds good. I'd prefer to do it as a separate review as it does involve changing the interface. I will put a patch together. I was imagining that we would change the cost to use getArithmeticReductionCost, which hopefully handles the details of how the target lowers reductions. I haven't looked deeply into the details yet though. That is on the list of things to do. dmgreen: Do you mean to change the interface to createTargetReduction, to take a bool instead? Yeah I…
ValueMap(ValueMap), ILV(ILV), Callback(Callback) {}		ValueMap(ValueMap), ILV(ILV), Callback(Callback) {}

/// The chosen Vectorization and Unroll Factors of the loop being vectorized.		/// The chosen Vectorization and Unroll Factors of the loop being vectorized.
unsigned VF;		unsigned VF;
unsigned UF;		unsigned UF;

/// Hold the indices to generate specific scalar instructions. Null indicates		/// Hold the indices to generate specific scalar instructions. Null indicates
/// that all instances are to be generated, using either scalar or vector		/// that all instances are to be generated, using either scalar or vector
▲ Show 20 Lines • Show All 365 Lines • ▼ Show 20 Lines	public:
/// SubclassID field of the VPRecipeBase objects. They are used for concrete		/// SubclassID field of the VPRecipeBase objects. They are used for concrete
/// type identification.		/// type identification.
using VPRecipeTy = enum {		using VPRecipeTy = enum {
VPBlendSC,		VPBlendSC,
VPBranchOnMaskSC,		VPBranchOnMaskSC,
VPInstructionSC,		VPInstructionSC,
VPInterleaveSC,		VPInterleaveSC,
VPPredInstPHISC,		VPPredInstPHISC,
		VPReductionSC,
VPReplicateSC,		VPReplicateSC,
VPWidenCallSC,		VPWidenCallSC,
VPWidenCanonicalIVSC,		VPWidenCanonicalIVSC,
VPWidenGEPSC,		VPWidenGEPSC,
VPWidenIntOrFpInductionSC,		VPWidenIntOrFpInductionSC,
VPWidenMemoryInstructionSC,		VPWidenMemoryInstructionSC,
VPWidenPHISC,		VPWidenPHISC,
VPWidenSC,		VPWidenSC,
▲ Show 20 Lines • Show All 398 Lines • ▼ Show 20 Lines	public:

/// Print the recipe.		/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,		void print(raw_ostream &O, const Twine &Indent,
VPSlotTracker &SlotTracker) const override;		VPSlotTracker &SlotTracker) const override;

const InterleaveGroup<Instruction> *getInterleaveGroup() { return IG; }		const InterleaveGroup<Instruction> *getInterleaveGroup() { return IG; }
};		};

		/// A recipe to represent inloop reduction operations, performing a reduction on
		AyalUnsubmitted Done Reply Inline Actions Comment is redundant. Ayal: Comment is redundant.
		/// a vector operand into a scalar value, and adding the result to a chain.
		class VPReductionRecipe : public VPRecipeBase {
		AyalUnsubmitted Done Reply Inline Actions private is redundant. Ayal: private is redundant.
		/// The recurrence decriptor for the reduction in question.
		RecurrenceDescriptor *RdxDesc;
		fhahnUnsubmitted Done Reply Inline Actions would be good to document the members. fhahn: would be good to document the members.
		/// The original instruction being converted to a reduction.
		Instruction *I;
		/// The VPValue of the vector value to be reduced.
		VPValue *VecOp;
		/// The VPValue of the scalar Chain being accumulated.
		VPValue *ChainOp;
		/// Fast math flags to use for the resulting reduction operation.
		bool NoNaN;
		/// Pointer to the TTI, needed to create the target reduction
		const TargetTransformInfo *TTI;

		public:
		VPReductionRecipe(RecurrenceDescriptor R, Instruction I, VPValue *ChainOp,
		VPValue VecOp, bool NoNaN, const TargetTransformInfo TTI)
		: VPRecipeBase(VPReductionSC), RdxDesc(R), I(I), VecOp(VecOp),
		ChainOp(ChainOp), NoNaN(NoNaN), TTI(TTI) {}

		~VPReductionRecipe() override = default;

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPReductionSC;
		}

		/// Generate the reduction in the loop
		void execute(VPTransformState &State) override;

		/// Print the recipe.
		void print(raw_ostream &O, const Twine &Indent,
		VPSlotTracker &SlotTracker) const override;
		};

/// VPReplicateRecipe replicates a given instruction producing multiple scalar		/// VPReplicateRecipe replicates a given instruction producing multiple scalar
/// copies of the original scalar type, one per lane, instead of producing a		/// copies of the original scalar type, one per lane, instead of producing a
/// single copy of widened type for all lanes. If the instruction is known to be		/// single copy of widened type for all lanes. If the instruction is known to be
/// uniform only one copy, per lane zero, will be generated.		/// uniform only one copy, per lane zero, will be generated.
class VPReplicateRecipe : public VPRecipeBase {		class VPReplicateRecipe : public VPRecipeBase {
/// The instruction being replicated.		/// The instruction being replicated.
Instruction *Ingredient;		Instruction *Ingredient;

▲ Show 20 Lines • Show All 943 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 794 Lines • ▼ Show 20 Lines	for (unsigned I = 0, E = getNumIncomingValues(); I < E; ++I) {
O << " ";		O << " ";
getIncomingValue(I)->printAsOperand(O, SlotTracker);		getIncomingValue(I)->printAsOperand(O, SlotTracker);
O << "/";		O << "/";
getMask(I)->printAsOperand(O, SlotTracker);		getMask(I)->printAsOperand(O, SlotTracker);
}		}
}		}
}		}

		void VPReductionRecipe::print(raw_ostream &O, const Twine &Indent,
		VPSlotTracker &SlotTracker) const {
		O << "\"REDUCE of" << *I << " as ";
		fhahnUnsubmitted Done Reply Inline Actions The ::print methods do not need to emit the " +\n" at the start and the "\\l\" parts after 4c8285c750bc fhahn: The ::print methods do not need to emit the " +\n" at the start and the "\\l\" parts after…
		ChainOp->printAsOperand(O, SlotTracker);
		O << " + reduce(";
		VecOp->printAsOperand(O, SlotTracker);
		O << ")";
		}

void VPReplicateRecipe::print(raw_ostream &O, const Twine &Indent,		void VPReplicateRecipe::print(raw_ostream &O, const Twine &Indent,
VPSlotTracker &SlotTracker) const {		VPSlotTracker &SlotTracker) const {
O << "\"" << (IsUniform ? "CLONE " : "REPLICATE ")		O << "\"" << (IsUniform ? "CLONE " : "REPLICATE ")
<< VPlanIngredient(Ingredient);		<< VPlanIngredient(Ingredient);
if (AlsoPack)		if (AlsoPack)
O << " (S->V)";		O << " (S->V)";
}		}

▲ Show 20 Lines • Show All 168 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/reduction-inloop-uf4.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -force-vector-width=4 -force-reduction-intrinsics -dce -instcombine -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=4 -force-vector-width=4 -prefer-inloop-reductions -force-reduction-intrinsics -dce -instcombine -S \| FileCheck %s

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

	define i32 @reduction_sum_single(i32* noalias nocapture %A) {			define i32 @reduction_sum_single(i32* noalias nocapture %A) {
	; CHECK-LABEL: @reduction_sum_single(			; CHECK-LABEL: @reduction_sum_single(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI1:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI1:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP11:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI2:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP10:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI2:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP13:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI3:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP11:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI3:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP15:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 4			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 4
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 8			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 8
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*			; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD5:%.]] = load <4 x i32>, <4 x i32> [[TMP5]], align 4			; CHECK-NEXT: [[WIDE_LOAD5:%.]] = load <4 x i32>, <4 x i32> [[TMP5]], align 4
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 12			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 12
	; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <4 x i32>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD6:%.]] = load <4 x i32>, <4 x i32> [[TMP7]], align 4			; CHECK-NEXT: [[WIDE_LOAD6:%.]] = load <4 x i32>, <4 x i32> [[TMP7]], align 4
	; CHECK-NEXT: [[TMP8]] = add <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD]])
	; CHECK-NEXT: [[TMP9]] = add <4 x i32> [[VEC_PHI1]], [[WIDE_LOAD4]]			; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP10]] = add <4 x i32> [[VEC_PHI2]], [[WIDE_LOAD5]]			; CHECK-NEXT: [[TMP10:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD4]])
	; CHECK-NEXT: [[TMP11]] = add <4 x i32> [[VEC_PHI3]], [[WIDE_LOAD6]]			; CHECK-NEXT: [[TMP11]] = add i32 [[TMP10]], [[VEC_PHI1]]
				; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD5]])
				; CHECK-NEXT: [[TMP13]] = add i32 [[TMP12]], [[VEC_PHI2]]
				; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD6]])
				; CHECK-NEXT: [[TMP15]] = add i32 [[TMP14]], [[VEC_PHI3]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0			; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[BIN_RDX:%.*]] = add <4 x i32> [[TMP9]], [[TMP8]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = add i32 [[TMP11]], [[TMP9]]
	; CHECK-NEXT: [[BIN_RDX7:%.*]] = add <4 x i32> [[TMP10]], [[BIN_RDX]]			; CHECK-NEXT: [[BIN_RDX7:%.*]] = add i32 [[TMP13]], [[BIN_RDX]]
	; CHECK-NEXT: [[BIN_RDX8:%.*]] = add <4 x i32> [[TMP11]], [[BIN_RDX7]]			; CHECK-NEXT: [[BIN_RDX8:%.*]] = add i32 [[TMP15]], [[BIN_RDX7]]
	; CHECK-NEXT: [[TMP13:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[BIN_RDX8]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !2			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !2
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[BIN_RDX8]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]			; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02 = phi i32 [ %l7, %.lr.ph ], [ 0, %entry ]			%sum.02 = phi i32 [ %l7, %.lr.ph ], [ 0, %entry ]
	Show All 12 Lines

llvm/test/Transforms/LoopVectorize/reduction-inloop.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -force-reduction-intrinsics -dce -instcombine -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -prefer-inloop-reductions -force-reduction-intrinsics -dce -instcombine -S \| FileCheck %s
				AyalUnsubmitted Not Done Reply Inline Actions It is indeed good to examine the diff in this review, but existing non -force-inloop-reductions RUN should be retained; e.g., by cloning the file for a separate -force-inloop-reductions RUN. Would be good to also test with -force-vector-interleave greater than 1. Ayal: It is indeed good to examine the diff in this review, but existing non -force-inloop-reductions…
				dmgreenAuthorUnsubmitted Done Reply Inline Actions This testcase is new for this patch (although that wasn't obvious from how I presented it). I was just trying to show the diffs here and the option didn't exist before this patch, hence it wasn't already there. Are you suggesting we should still have these same test for without -force-inloop-reductions? I think those should already be tested elsewhere, but I can add it if you think it's useful. dmgreen: This testcase is new for this patch (although that wasn't obvious from how I presented it). I…
				AyalUnsubmitted Not Done Reply Inline Actions ok, understood; no need to retain non-existing tests. Ayal: ok, understood; no need to retain non-existing tests.

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

	define i32 @reduction_sum_single(i32* noalias nocapture %A) {			define i32 @reduction_sum_single(i32* noalias nocapture %A) {
	; CHECK-LABEL: @reduction_sum_single(			; CHECK-LABEL: @reduction_sum_single(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2]] = add <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD]])
				; CHECK-NEXT: [[TMP3]] = add i32 [[TMP2]], [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP3:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP3]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0			; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !2			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !2
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP3]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]			; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02 = phi i32 [ %l7, %.lr.ph ], [ 0, %entry ]			%sum.02 = phi i32 [ %l7, %.lr.ph ], [ 0, %entry ]
	%l2 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv			%l2 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
	%l3 = load i32, i32* %l2, align 4			%l3 = load i32, i32* %l2, align 4
	%l7 = add i32 %sum.02, %l3			%l7 = add i32 %sum.02, %l3
	%indvars.iv.next = add i64 %indvars.iv, 1			%indvars.iv.next = add i64 %indvars.iv, 1
	%lftr.wideiv = trunc i64 %indvars.iv.next to i32			%lftr.wideiv = trunc i64 %indvars.iv.next to i32
	%exitcond = icmp eq i32 %lftr.wideiv, 256			%exitcond = icmp eq i32 %lftr.wideiv, 256
	br i1 %exitcond, label %._crit_edge, label %.lr.ph			br i1 %exitcond, label %._crit_edge, label %.lr.ph

	._crit_edge: ; preds = %.lr.ph			._crit_edge: ; preds = %.lr.ph
	%sum.0.lcssa = phi i32 [ %l7, %.lr.ph ]			%sum.0.lcssa = phi i32 [ %l7, %.lr.ph ]
	ret i32 %sum.0.lcssa			ret i32 %sum.0.lcssa
	}			}

	define i32 @reduction_sum(i32* noalias nocapture %A, i32* noalias nocapture %B) {			define i32 @reduction_sum(i32* noalias nocapture %A, i32* noalias nocapture %B) {
	fhahnUnsubmitted Done Reply Inline Actions I think it would be also good to add a few additional simpler test cases, e.g. reductions with only 1 or 2 reduction steps in the loop, some with constant scalar operands. And with some of the (unnecessary trounces stripped). Also the test diff can be slightly reduced by having constant trip counts. It's good to have the more complex ones, but it would also be good to start with simpler cases. That way it is probably easier to follow what's going on. And it would reduce the diff quite a bit if the tests for the various different opcodes would be a bit more compact. fhahn: I think it would be also good to add a few additional simpler test cases, e.g. reductions with…
	dmgreenAuthorUnsubmitted Done Reply Inline Actions (unnecessary trounces stripped)? The const case is going to look a little odd until there is a instcombine type fold for constant reductions. I'll see about adding that. Otherwise it is now hopefully a bit leaner. Even if I have added a few extra tests. dmgreen: (unnecessary trounces stripped)? The const case is going to look a little odd until there is a…
	; CHECK-LABEL: @reduction_sum(			; CHECK-LABEL: @reduction_sum(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[VEC_PHI]], [[VEC_IND2]]			; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[VEC_IND2]])
	; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP5:%.*]] = add i32 [[TMP4]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP6]] = add <4 x i32> [[TMP5]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD]])
				; CHECK-NEXT: [[TMP7:%.*]] = add i32 [[TMP6]], [[TMP5]]
				; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD1]])
				; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[TMP7]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i32> [[VEC_IND2]], <i32 4, i32 4, i32 4, i32 4>			; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i32> [[VEC_IND2]], <i32 4, i32 4, i32 4, i32 4>
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4			; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP6]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !5			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !5
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP9]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]			; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02 = phi i32 [ %l9, %.lr.ph ], [ 0, %entry ]			%sum.02 = phi i32 [ %l9, %.lr.ph ], [ 0, %entry ]
	Show All 18 Lines
	define i32 @reduction_sum_const(i32* noalias nocapture %A) {			define i32 @reduction_sum_const(i32* noalias nocapture %A) {
	; CHECK-LABEL: @reduction_sum_const(			; CHECK-LABEL: @reduction_sum_const(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD]])
	; CHECK-NEXT: [[TMP3]] = add <4 x i32> [[TMP2]], <i32 3, i32 3, i32 3, i32 3>			; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[TMP2]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP4]] = add i32 [[TMP3]], 12
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !6			; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !6
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP3]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !7			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !7
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]			; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02 = phi i32 [ %l9, %.lr.ph ], [ 0, %entry ]			%sum.02 = phi i32 [ %l9, %.lr.ph ], [ 0, %entry ]
	Show All 14 Lines
	define i32 @reduction_prod(i32* noalias nocapture %A, i32* noalias nocapture %B) {			define i32 @reduction_prod(i32* noalias nocapture %A, i32* noalias nocapture %B) {
	; CHECK-LABEL: @reduction_prod(			; CHECK-LABEL: @reduction_prod(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1, i32 1, i32 1, i32 1>, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 1, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = mul <4 x i32> [[VEC_PHI]], [[VEC_IND2]]			; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[VEC_IND2]])
	; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[TMP4]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP5:%.*]] = mul i32 [[TMP4]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP6]] = mul <4 x i32> [[TMP5]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[WIDE_LOAD]])
				; CHECK-NEXT: [[TMP7:%.*]] = mul i32 [[TMP6]], [[TMP5]]
				; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[WIDE_LOAD1]])
				; CHECK-NEXT: [[TMP9]] = mul i32 [[TMP8]], [[TMP7]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i32> [[VEC_IND2]], <i32 4, i32 4, i32 4, i32 4>			; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i32> [[VEC_IND2]], <i32 4, i32 4, i32 4, i32 4>
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8			; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[TMP6]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !9			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !9
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[PROD_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[PROD_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP9]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[PROD_0_LCSSA]]			; CHECK-NEXT: ret i32 [[PROD_0_LCSSA]]
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%prod.02 = phi i32 [ %l9, %.lr.ph ], [ 1, %entry ]			%prod.02 = phi i32 [ %l9, %.lr.ph ], [ 1, %entry ]
	Show All 18 Lines
	define i32 @reduction_mix(i32* noalias nocapture %A, i32* noalias nocapture %B) {			define i32 @reduction_mix(i32* noalias nocapture %A, i32* noalias nocapture %B) {
	; CHECK-LABEL: @reduction_mix(			; CHECK-LABEL: @reduction_mix(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[VEC_PHI]], [[VEC_IND2]]			; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[VEC_IND2]])
	; CHECK-NEXT: [[TMP6]] = add <4 x i32> [[TMP5]], [[TMP4]]			; CHECK-NEXT: [[TMP6:%.*]] = add i32 [[TMP5]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP4]])
				; CHECK-NEXT: [[TMP8]] = add i32 [[TMP7]], [[TMP6]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i32> [[VEC_IND2]], <i32 4, i32 4, i32 4, i32 4>			; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i32> [[VEC_IND2]], <i32 4, i32 4, i32 4, i32 4>
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !10			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !10
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP6]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !11			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !11
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]			; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
	Show All 25 Lines
	define i32 @reduction_mul(i32* noalias nocapture %A, i32* noalias nocapture %B) {			define i32 @reduction_mul(i32* noalias nocapture %A, i32* noalias nocapture %B) {
	; CHECK-LABEL: @reduction_mul(			; CHECK-LABEL: @reduction_mul(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 19, i32 1, i32 1, i32 1>, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 19, [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = mul <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[WIDE_LOAD]])
	; CHECK-NEXT: [[TMP5]] = mul <4 x i32> [[TMP4]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP5:%.*]] = mul i32 [[TMP4]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[WIDE_LOAD1]])
				; CHECK-NEXT: [[TMP7]] = mul i32 [[TMP6]], [[TMP5]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !12			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !12
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[TMP5]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !13			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !13
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]			; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
	Show All 23 Lines
	define i32 @start_at_non_zero(i32* nocapture %in, i32* nocapture %coeff, i32* nocapture %out) {			define i32 @start_at_non_zero(i32* nocapture %in, i32* nocapture %coeff, i32* nocapture %out) {
	; CHECK-LABEL: @start_at_non_zero(			; CHECK-LABEL: @start_at_non_zero(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 120, i32 0, i32 0, i32 0>, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 120, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[COEFF:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[COEFF:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP5]] = add <4 x i32> [[TMP4]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP4]])
				; CHECK-NEXT: [[TMP6]] = add i32 [[TMP5]], [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !14			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !14
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP5]])
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !15			; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !15
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]			; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%sum.09 = phi i32 [ %add, %for.body ], [ 120, %entry ]			%sum.09 = phi i32 [ %add, %for.body ], [ 120, %entry ]
	Show All 16 Lines
	define i32 @reduction_and(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_and(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_and(			; CHECK-LABEL: @reduction_and(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 -1, i32 -1, i32 -1, i32 -1>, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ -1, [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = and <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> [[WIDE_LOAD]])
	; CHECK-NEXT: [[TMP5]] = and <4 x i32> [[TMP4]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP5:%.*]] = and i32 [[TMP4]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> [[WIDE_LOAD1]])
				; CHECK-NEXT: [[TMP7]] = and i32 [[TMP6]], [[TMP5]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !16			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !16
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> [[TMP5]])
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !17			; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !17
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]			; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]
	Show All 23 Lines
	define i32 @reduction_or(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_or(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_or(			; CHECK-LABEL: @reduction_or(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP5]] = or <4 x i32> [[TMP4]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.vector.reduce.or.v4i32(<4 x i32> [[TMP4]])
				; CHECK-NEXT: [[TMP6]] = or i32 [[TMP5]], [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !18			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !18
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.experimental.vector.reduce.or.v4i32(<4 x i32> [[TMP5]])
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !19			; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !19
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]			; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %or, %for.body ], [ 0, %entry ]			%result.08 = phi i32 [ %or, %for.body ], [ 0, %entry ]
	Show All 16 Lines
	define i32 @reduction_xor(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_xor(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_xor(			; CHECK-LABEL: @reduction_xor(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP5]] = xor <4 x i32> [[TMP4]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.vector.reduce.xor.v4i32(<4 x i32> [[TMP4]])
				; CHECK-NEXT: [[TMP6]] = xor i32 [[TMP5]], [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !20			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !20
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.experimental.vector.reduce.xor.v4i32(<4 x i32> [[TMP5]])
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !21			; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !21
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]			; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %xor, %for.body ], [ 0, %entry ]			%result.08 = phi i32 [ %xor, %for.body ], [ 0, %entry ]
	Show All 16 Lines
	define float @reduction_fadd(float* nocapture %A, float* nocapture %B) {			define float @reduction_fadd(float* nocapture %A, float* nocapture %B) {
	; CHECK-LABEL: @reduction_fadd(			; CHECK-LABEL: @reduction_fadd(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[TMP0]] to <4 x float>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[TMP0]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = fadd fast <4 x float> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP4:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.000000e+00, <4 x float> [[WIDE_LOAD]])
	; CHECK-NEXT: [[TMP5]] = fadd fast <4 x float> [[TMP4]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP5:%.*]] = fadd float [[TMP4]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP6:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.000000e+00, <4 x float> [[WIDE_LOAD1]])
				; CHECK-NEXT: [[TMP7]] = fadd float [[TMP6]], [[TMP5]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !22			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !22
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP7:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.000000e+00, <4 x float> [[TMP5]])
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !23			; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !23
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi float [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi float [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret float [[RESULT_0_LCSSA]]			; CHECK-NEXT: ret float [[RESULT_0_LCSSA]]
	Show All 23 Lines
	define float @reduction_fmul(float* nocapture %A, float* nocapture %B) {			define float @reduction_fmul(float* nocapture %A, float* nocapture %B) {
	; CHECK-LABEL: @reduction_fmul(			; CHECK-LABEL: @reduction_fmul(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x float> [ <float 0.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[TMP0]] to <4 x float>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[TMP0]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = fmul fast <4 x float> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP4:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.000000e+00, <4 x float> [[WIDE_LOAD]])
	; CHECK-NEXT: [[TMP5]] = fmul fast <4 x float> [[TMP4]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP5:%.*]] = fmul float [[TMP4]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP6:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.000000e+00, <4 x float> [[WIDE_LOAD1]])
				; CHECK-NEXT: [[TMP7]] = fmul float [[TMP6]], [[TMP5]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !24			; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !24
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP7:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.000000e+00, <4 x float> [[TMP5]])
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !25			; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !25
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi float [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi float [ undef, [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret float [[RESULT_0_LCSSA]]			; CHECK-NEXT: ret float [[RESULT_0_LCSSA]]
	Show All 23 Lines
	define i32 @reduction_min(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_min(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_min(			; CHECK-LABEL: @reduction_min(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1000, i32 1000, i32 1000, i32 1000>, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 1000, [[VECTOR_PH]] ], [ [[RDX_MINMAX_SELECT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.*]] = icmp slt <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> [[WIDE_LOAD]])
	; CHECK-NEXT: [[TMP3]] = select <4 x i1> [[TMP2]], <4 x i32> [[VEC_PHI]], <4 x i32> [[WIDE_LOAD]]			; CHECK-NEXT: [[RDX_MINMAX_CMP:%.*]] = icmp slt i32 [[TMP2]], [[VEC_PHI]]
				; CHECK-NEXT: [[RDX_MINMAX_SELECT]] = select i1 [[RDX_MINMAX_CMP]], i32 [[TMP2]], i32 [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP3:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !26			; CHECK-NEXT: br i1 [[TMP3]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !26
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> [[TMP3]])
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !27			; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !27
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[RDX_MINMAX_SELECT]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]			; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %v0, %for.body ], [ 1000, %entry ]			%result.08 = phi i32 [ %v0, %for.body ], [ 1000, %entry ]
	Show All 14 Lines
	define i32 @reduction_max(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_max(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_max(			; CHECK-LABEL: @reduction_max(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1000, i32 1000, i32 1000, i32 1000>, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 1000, [[VECTOR_PH]] ], [ [[RDX_MINMAX_SELECT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.*]] = icmp ugt <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> [[WIDE_LOAD]])
	; CHECK-NEXT: [[TMP3]] = select <4 x i1> [[TMP2]], <4 x i32> [[VEC_PHI]], <4 x i32> [[WIDE_LOAD]]			; CHECK-NEXT: [[RDX_MINMAX_CMP:%.*]] = icmp ugt i32 [[TMP2]], [[VEC_PHI]]
				; CHECK-NEXT: [[RDX_MINMAX_SELECT]] = select i1 [[RDX_MINMAX_CMP]], i32 [[TMP2]], i32 [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP3:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !28			; CHECK-NEXT: br i1 [[TMP3]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !28
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> [[TMP3]])
	; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !29			; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop !29
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[RESULT_0_LCSSA:%.*]] = phi i32 [ undef, [[FOR_BODY]] ], [ [[RDX_MINMAX_SELECT]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]			; CHECK-NEXT: ret i32 [[RESULT_0_LCSSA]]
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %v0, %for.body ], [ 1000, %entry ]			%result.08 = phi i32 [ %v0, %for.body ], [ 1000, %entry ]
	▲ Show 20 Lines • Show All 201 Lines • ▼ Show 20 Lines
	define i32 @reduction_predicated(i32* noalias nocapture %A, i32* noalias nocapture %B) {			define i32 @reduction_predicated(i32* noalias nocapture %A, i32* noalias nocapture %B) {
	; CHECK-LABEL: @reduction_predicated(			; CHECK-LABEL: @reduction_predicated(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_IND2:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT3:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[INDEX]]
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[VEC_PHI]], [[VEC_IND2]]			; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[VEC_IND2]])
	; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], [[WIDE_LOAD]]			; CHECK-NEXT: [[TMP5:%.*]] = add i32 [[TMP4]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP6]] = add <4 x i32> [[TMP5]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD]])
				; CHECK-NEXT: [[TMP7:%.*]] = add i32 [[TMP6]], [[TMP5]]
				; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[WIDE_LOAD1]])
				; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[TMP7]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i32> [[VEC_IND2]], <i32 4, i32 4, i32 4, i32 4>			; CHECK-NEXT: [[VEC_IND_NEXT3]] = add <4 x i32> [[VEC_IND2]], <i32 4, i32 4, i32 4, i32 4>
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !34			; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !34
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP6]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !35			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !35
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP9]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]			; CHECK-NEXT: ret i32 [[SUM_0_LCSSA]]
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02 = phi i32 [ %l9, %.lr.ph ], [ 0, %entry ]			%sum.02 = phi i32 [ %l9, %.lr.ph ], [ 0, %entry ]
	▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines
	define i8 @reduction_and_trunc(i8* noalias nocapture %A) {			define i8 @reduction_and_trunc(i8* noalias nocapture %A) {
	; CHECK-LABEL: @reduction_and_trunc(			; CHECK-LABEL: @reduction_and_trunc(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 255, i32 -1, i32 -1, i32 -1>, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 255, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = sext i32 [[INDEX]] to i64			; CHECK-NEXT: [[TMP0:%.*]] = and i32 [[VEC_PHI]], 255
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.*]] = sext i32 [[INDEX]] to i64
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[TMP1]] to <4 x i8>*			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i64 [[TMP1]]
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP2]], align 4			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*
	; CHECK-NEXT: [[TMP3:%.*]] = zext <4 x i8> [[WIDE_LOAD]] to <4 x i32>			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP4]] = and <4 x i32> [[VEC_PHI]], [[TMP3]]			; CHECK-NEXT: [[TMP4:%.*]] = zext <4 x i8> [[WIDE_LOAD]] to <4 x i32>
				; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> [[TMP4]])
				; CHECK-NEXT: [[TMP6]] = and i32 [[TMP5]], [[TMP0]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
	; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !38			; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !38
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> [[TMP4]])
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !39			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop !39
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: [[RET:%.*]] = trunc i32 [[SUM_0_LCSSA]] to i8			; CHECK-NEXT: [[RET:%.*]] = trunc i32 [[SUM_0_LCSSA]] to i8
	Show All 26 Lines