This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
1/3
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
CodeGen/SelectionDAG/
-
SelectionDAG/
2/2
SelectionDAG.cpp
-
Target/AArch64/
-
AArch64/
2/2
AArch64TargetTransformInfo.h
3/10
AArch64TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorizationPlanner.h
19/27
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
6/7
scalable_reductions.ll

Differential D95245

[SVE] Add support for scalable vectorization of loops with int/fast FP reductions
ClosedPublic

Authored by kmclaughlin on Jan 22 2021, 9:28 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
greened
fhahn
frasercrmck
efriedma
dmgreen

Commits

rGba1e150d03ca: [SVE] Add support for scalable vectorization of loops with int/fast FP…

Summary

This patch enables scalable vectorization of loops with integer/fast reductions, e.g:

unsigned sum = 0;
for (int i = 0; i < n; ++i) {
  sum += a[i];
}

A new TTI interface, isLegalToVectorizeReduction, has been added to prevent
reductions which are not supported for scalable types from vectorizing.
If the reduction is not supported for a given scalable VF,
computeFeasibleMaxVF will fall back to using fixed-width vectorization.

Diff Detail

Event Timeline

kmclaughlin created this revision.Jan 22 2021, 9:28 AM

Herald added a reviewer: efriedma. · View Herald TranscriptJan 22 2021, 9:28 AM

Herald added subscribers: NickHung, bmahjour, psnobl and 2 others. · View Herald Transcript

kmclaughlin requested review of this revision.Jan 22 2021, 9:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 22 2021, 9:28 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B86309: Diff 318549.Jan 22 2021, 10:08 AM

cameron.mcinally added a subscriber: cameron.mcinally.Jan 22 2021, 11:07 AM

bmahjour removed a subscriber: bmahjour.Jan 22 2021, 11:09 AM

Matt added a subscriber: Matt.Jan 22 2021, 11:13 AM

timsmith78 added a subscriber: timsmith78.Jan 22 2021, 12:18 PM

Hey Kerry,
Thank you for this patch.
I found some nit and I have some suggestions about instructionCost.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4695	So once we start to use Scalable vector and we start to use the VF.getKnownMinValue(), shouldn't;t this be multiplied by getMaxVScale()?
4742	Same here, should we not need to multiply by getMaxVScale()?
6177	I believe we can use LoopCost.isValid(), here!
6198	Can you change SmallLoopCost to be instruction cost as LoopCost, so you don't need to use *LoopCost.getValue()? And I believe that in the std::min you will not need to use getValue
7677	nit
9441	nit

david-arm added inline comments.Jan 25 2021, 1:37 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1103	Just a thought - if we're excluding FMul from reductions is it worth having an assert here that the op is not fmul?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1525	It might be worth printing out the recurrence kind here. Do we also want to emit a remark here to help the user understand why it failed to vectorise?
4695	This is for vectorise of induction variables. I think we'll have to use a runtime VF that I introduced in D95139 here. I don't think Kerry has to fix this in her patch.
6177	I think since we're changing LoopCost to be InstructionCost we can change the line above too from LoopCost = *expectedCost(VF).first.getValue(); to LoopCost = expectedCost(VF).first;

david-arm added inline comments.Jan 25 2021, 1:37 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9457	Similar to an earlier comment, a remark here would be good I think.
llvm/test/Transforms/LoopVectorize/scalable_reductions.ll
1	Needs a "REQUIRES: asserts" here I think because you're relying upon debug output. Also, since you're explicitly adding "-mattr=+sve" here I think you'll either have to: Make the test generic work for all targets (this test will fail on some builds due to lack of AArch64 support), or Move the test for LoopVectorize/AArch64
16	I wonder if it's worth adding CHECK lines for the resulting IR to show we've vectorised the loop using reductions and checking we have the right structure, i.e. vector.body, middle.block, etc?

fhahn added inline comments.Jan 25 2021, 2:02 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1309	This should probably have a comment,.
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
3924	Can you add a test for this? Also, this seems completely unrelated, can you split it off?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1277	those changes could also be submitted separately?
1519	This also needs a comment. And the name could probably be improved. Maybe `canVectorizeReductions`?
7665	This should only be checked in the code handling `UserVF` below? Also, This seems like a property that generally limits to vectorization factor to fixed-width vectorization factors and would be good to check beforehand. Would it be possible to just limit vectorization factors to fixed width factors in `computeFeasibleMaxVF`? This way, we won't need extra checks once automatically picked VFs are supported. You'd also won't need any extra code in the caller of `::plan`. This is similar to how we deal with other 'legality' properties that depend on the vectorization factor, like dependencies that may limit the vectorization factor.
9455	This message seems a bit odd. I think the cost model should just be responsible for assigning a cost, not deciding whether it is possible to vectorize or not; that's the job of the legality checks. Please see my comment above, the could probably done in `computeFeasibleMaxVF`, which technically is part of the cost model, but is the first step and applies other legality constraints as well which limit the vectorization factor.
llvm/test/Transforms/LoopVectorize/scalable_reductions.ll
9	Personally I don't think the C source code adds much value. The IR is very compact and it should be obvious from the IR & test name what is going on. Also, the IR that clang generates can change, clang options may change, pragmas may change and so on.
20	this should not be needed for the test.
23	this should not be needed for the test, you can just pass `%n` as `i64`.
27	nit: can strip `indvars` from the name to mark things more compact.

Removed changes to LoopVectorizationPlanner::plan and instead check whether reductions can be vectorized in computeFeasibleMaxVF. If any reduction in the loop cannot be vectorized with a scalable VF, we fall back on fixed-width vectorization.

Changes to have VectorizationFactor use InstructionCost were not necessary to the patch after the above change and have also been removed.

Improved the tests in scalable_reductions.ll based on suggestions from @fhahn & @david-arm

Thanks for reviewing this patch, all!

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
3924	I've removed this from the patch, I don't think it's required for the tests here.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6198	Hi @CarolineConcatto, thanks for your suggestions on InstructionCost! I didn't change the SmallLoopCost flag to be an instruction cost in the last revision as this caused tests which use -small-loop-cost to fail (e.g. LoopVectorize/unroll_novec.ll)
7665	Thanks for this suggestion, @fhahn. I've moved the canVectorizeReductions check to `computeFeasibleMaxVF` & updated the affected test in scalable_reductions.ll, where we can use fixed-width vectorization instead (`@mul`)
llvm/test/Transforms/LoopVectorize/scalable_reductions.ll
1	Added `REQUIRES: asserts` & moved the test to `Transforms/LoopVectorize/AArch64`

david-arm added inline comments.Jan 27 2021, 1:36 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1519	nit: Perhaps use `///` here instead of '//' in line with other function comments?
5691	I wonder if it's worth bailing out even earlier, i.e. in the same place as above where you check initially? I think the main benefit to bailing out here is if you can reduce the VF to something smaller so that it becomes legal. However, I think for reductions changing the VF won't make a difference in practice.
5700	nit: Perhaps use "operations" here instead of types? I'm thinking that the user probably isn't aware of the RecurrenceKind so type might not make as much sense?
llvm/test/Transforms/LoopVectorize/AArch64/scalable_reductions.ll
2 ↗	(On Diff #319303)	I think you can reduce the number of RUN lines here by piping stderr for the first RUN line to a temporary file, e.g. something like ; RUN: opt < %s -loop-vectorize -transform-warning -mtriple aarch64-unknown-linux-gnu -mattr=+sve -debug-only=loop-vectorize -S 2>%t \| FileCheck %s -check-prefix=CHECK ; RUN cat %t \| FileCheck %s -check-prefix=CHECK-DEBUG
4 ↗	(On Diff #319303)	Is it worth changing this to check for the new remark instead? You can use something like this: ; RUN: opt < %s -loop-vectorize -pass-remarks='loop-vectorize' -disable-output -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S 2>&1 \| ...
223 ↗	(On Diff #319303)	I'm a bit surprised this vectorises to be honest, since there is no 'fast' flag here! Perhaps for IEEE math you have to add specific attributes to the function?

Moved the canVectorizeReductions check to earlier in computeFeasibleMaxVF
Updated the RUN lines in scalable_reductions.ll
Removed duplicate test for FAdd

kmclaughlin marked an inline comment as not done.Feb 1 2021, 10:11 AM

kmclaughlin added inline comments.

llvm/test/Transforms/LoopVectorize/AArch64/scalable_reductions.ll
223 ↗	(On Diff #319303)	I think what happened here is that the hints used to enable vectorization have allowed reordering, similar to using -Ofast. I found this comment at the top of allowReordering() in LoopVectorizationLegality: // When enabling loop hints are provided we allow the vectorizer to change // the order of operations that is given by the scalar loop. This is not // enabled by default because can be unsafe or inefficient. For example, // reordering floating-point operations will change the way round-off // error accumulates in the loop. This behaviour was queried on the mailing list last year: https://lists.llvm.org/pipermail/llvm-dev/2020-June/142697.html

dcaballe added a subscriber: dcaballe.Feb 1 2021, 10:17 AM

dmgreen added a subscriber: dmgreen.Feb 1 2021, 10:26 AM

dmgreen added inline comments.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1310	Does this need to check the type? Does an i128 reduction work, for example? I presume if a <vscale x 4 x float> reduction works then any <vscale x ? x float> will work?

Thanks for making the changes @kmclaughlin! Just a couple more comments ...

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5663	I think this looks much better now you're just checking reductions only once and early on - thanks for this! However, I think you might need to move this check down to line 5677 where we return UserVF. So the reason I think this is because if we have a loop that contains memory dependences and reductions in the same loop we want to ensure we always do the reduction checks regardless. For example, Legal->isSafeForAnyVectorWidth() could return false and then in the code below we may successfully reduce the UserVF from <vscale x 8 x float> to <vscale x 4 x float> without ever calling canVectorizeReductions.
llvm/test/Transforms/LoopVectorize/AArch64/scalable_reductions.ll
1 ↗	(On Diff #320512)	Thanks for RUN line changes here - looks a lot neater now thanks! If it's not too difficult I think it would be great if you could test the remark here too, since this is user-facing rather than debug. If you want you can even test the remark instead of the debug - this would also mean you can remove the "REQUIRE: asserts" line above too.

sdesmalen added inline comments.Feb 2 2021, 4:16 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1078	nit: bail out early to reduce indentation. if (!Scalable) return true;
1095	nit: can be removed if you add the early bail out.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1522	nit: use `return llvm::all_of(....)` with lambda, instead of loop?
1525	Is it worth just passing the whole Recurrence descriptor and the whole of VF? When passing the whole Recurrence descriptor, in the future the function can also determine whether it can vectorize an ordered reduction (e.g. ordered fadd) in the loop body using some instruction.

fhahn added inline comments.Feb 2 2021, 4:39 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5663	please also add a test for this scenario.

Moved the Legal->isSafeForAnyVectorWidth() check in computeFeasibleMaxVF further down so that we always check the reductions even if the loop contains memory dependencies. Added a test for this scenario to scalable_reductions.ll.

Changed isLegalToVectorizeReduction so that the whole RecurrenceDescriptor and VF are passed in, and added a check of the recurrence type.

Replaced the loop in canVectorizeReductions with lambda

Removed REQUIRE: asserts from the test file and added -pass-remarks-analysis/missed flags to the RUN line

llvm/include/llvm/Analysis/TargetTransformInfo.h
1310	Hi @dmgreen, thanks for taking a look at this! I've added a check of the recurrence type to isLegalToVectorizeReduction. I think any <vscale x ? x float> reduction will work, I added some tests for legalization of vector reductions as part of D93050.

LGTM! Thanks for making all changes. Perhaps wait a while before merging in case others want a look?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1522	nit: I think you can remove the '(' and ')' surrounding the llvm::all_of call here.

LGTM. Forgot to click "Accept Revision" before. Doh!

This revision is now accepted and ready to land.Feb 3 2021, 8:40 AM

dmgreen added inline comments.Feb 4 2021, 12:42 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1082	Thanks. This looks like it should work for most current types. Are bfloats always supported? It may be better to be more specific in case other smaller-than-64bit float types are added in the future.

david-arm added inline comments.Feb 4 2021, 1:00 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1082	Is this needed though? If bfloats are in the scalar IR it means that the user has explicitly written code using the SVE ACLE so I'd imagine that all bets are off anyway if they didn't build with bf16 support. I'd also imagine that these would be flagged up as illegal types earlier on in the vectoriser too I think?

dmgreen added inline comments.Feb 4 2021, 3:46 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1082	Hmm. I guess I I don't see the advantage of getting it wrong. Clang isn't the only frontend and the vectorizer needs to take any valid input and not crash or produce code that will later crash. Being specific about which types are supported seems like a better idea to me than hoping it works and hoping that won't change in the future.

david-arm added inline comments.Feb 4 2021, 3:58 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1082	No that's a fair point and happy for @kmclaughlin to add the check. However, we can't test such a scenario even with hand written IR because the vectoriser crashes without bfloat support: LLVM ERROR: Cannot legalize this vector #8 0x0000ffff959efad8 llvm::TargetLoweringBase::getTypeConversion(llvm::LLVMContext&, llvm::EVT) const (.localalias) (/home/davshe01/upstream/llvm-project/build2/bin/../lib/libLLVMSupport.so.13git+0xcfad8) #9 0x0000ffff959efbd8 llvm::TargetLoweringBase::getTypeLegalizationCost(llvm::DataLayout const&, llvm::Type*) const (/home/davshe01/upstream/llvm-project/build2/bin/../lib/libLLVMSupport.so.13git+0xcfbd8)

Added a function called isLegalScalarTypeForSVE which checks that the reduction type is supported & added a new test which uses bfloat to scalable-reductions.ll

Nice one. Thanks for the change. LGTM

sdesmalen added inline comments.Feb 4 2021, 2:09 PM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1103	The same should hold for integer Mul. nit: you can better add that to the switch statement below as: case Instruction::Mul: case Instruction::FMul: assert(!isa<ScalableVectorType>(Ty) && "Unexpected ..."); LLVM_FALLTHROUGH; case Instruction::Fadd: ...
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
189	Can you merge this function with `isLegalScalarTypeForSVEMaskedMemOp` and name it `isLegalElementTypeForSVE`? I think their implementation should be the same (including your check here for `hasBF16`)

sdesmalen added inline comments.Feb 5 2021, 1:54 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
190–191	I forgot to mention that there are no reduction instructions for bfloat, so you'll need to catch out that specific case in `isLegalToVectorizeReduction`

Merged isLegalScalarTypeForSVEMaskedMemOp & isLegalScalarTypeForSVE
Return false from isLegalToVectorizeReduction for bfloat types
Included isa<ScalableVectorType>(Ty) in the switch statement conditions of useReductionIntrinsic

Thanks for the changes. I only have some more comments about the tests now.

llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions.ll
101 ↗	(On Diff #321749)	nit: remove `dso_local` here and in other definitions.
340 ↗	(On Diff #321749)	This CHECK-DEBUG (with it's own RUN line) is not checking which function is not vectorizing, it could just as well be emitted for one of the other functions. I'd suggest explicitly adding checks for `@mul` and adding a CHECK-DEBUG line for the other tests as well.
376 ↗	(On Diff #321749)	Same as above. Can you also add a comment saying why you're testing a `memory_dependence` issue in a test file called `scalable-reductions.ll` ?
424 ↗	(On Diff #321749)	These two fmin/fmax tests are not very useful, because the loop doesn't fail to vectorize because of code added in this patch.
470 ↗	(On Diff #321749)	nit: use `nnan` directly in the fp operation instead of an attribute.

david-arm added inline comments.Feb 8 2021, 12:38 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1103	Hi @sdesmalen, just for information the reason I'd asked for an assert here is that if we're still intending to create a target reduction intrinsic at this point with a mul or fmul then something has gone badly wrong and is almost certainly a bug. This is because this function is only ever called at the point where you've already decided that it's legal to reduce a scalable mul operation. The two places where this is called are from SLPVectorizer.cpp:createSimpleTargetReduction and InnerLoopVectorizer::fixReduction (via createTargetReduction).

david-arm added inline comments.Feb 8 2021, 1:14 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1103	Sorry, please ignore my comment! For some reason I hadn't seen the assert in there.

kmclaughlin mentioned this in D96350: [SVE][LoopVectorize] Enable vectorization of fmin/fmax with nnan.Feb 9 2021, 9:09 AM

Changes to the tests in scalable-reductions.ll:

Removed dso_local from definitions
Added a comment on the purpose of the memory_dependence test
Added CHECK-REMARK lines for each test in the file
Removed the unnecessary fmin/fmax tests where we can't vectorize

llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions.ll
470 ↗	(On Diff #321749)	Hi @sdesmalen, these tests for fmin/fmax fail without the `no-nans-fp-math` attribute, I think because `RecurrenceDescriptor::isRecurrenceInstr` is just checking for the function attribute and not the flags on the instruction. I've created a separate patch (D96350) to try and address this.

Rebased changes

LGTM! Latest version looks good and I think you've addressed @sdesmalen's comments. Thanks!

llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions.ll
376 ↗	(On Diff #323303)	nit: Perhaps you could make it clear you're testing the ordering, i.e. with something like: This test was added to ensure we always check the legality of reductions (end emit a warning if necessary) before checking for memory dependencies

LV changes LGTM, thanks for the updates!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1520	nit: `.` at end of sentence.
1522	nit: `llvm::` should not be required
5669	I think you should be bale to use `reportVectorizationFailure` to print to `dbgs()` and generate a remark with the same message
llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions.ll
20 ↗	(On Diff #323303)	nit: those checks should not be needed.

Closed by commit rGba1e150d03ca: [SVE] Add support for scalable vectorization of loops with int/fast FP… (authored by kmclaughlin). · Explain WhyFeb 16 2021, 5:50 AM

This revision was automatically updated to reflect the committed changes.

kmclaughlin marked 5 inline comments as done.

kmclaughlin added a commit: rGba1e150d03ca: [SVE] Add support for scalable vectorization of loops with int/fast FP….

Thanks all for reviewing these changes!

sdesmalen mentioned this in D96021: [LoopVectorize] NFC: Move UserVF feasibility checks to separate function..Feb 16 2021, 6:39 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

9 lines

TargetTransformInfoImpl.h

3 lines

lib/

Analysis/

TargetTransformInfo.cpp

5 lines

CodeGen/

SelectionDAG/

SelectionDAG.cpp

2 lines

Target/

AArch64/

AArch64TargetTransformInfo.h

2 lines

AArch64TargetTransformInfo.cpp

28 lines

Transforms/

Vectorize/

LoopVectorizationPlanner.h

3 lines

LoopVectorize.cpp

38 lines

test/

Transforms/

LoopVectorize/

scalable_reductions.ll

465 lines

Diff 318549

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show All 15 Lines
/// This file defines #2, which is the interface that IR-level transformations		/// This file defines #2, which is the interface that IR-level transformations
/// use for querying the codegen.		/// use for querying the codegen.
///		///
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_ANALYSIS_TARGETTRANSFORMINFO_H		#ifndef LLVM_ANALYSIS_TARGETTRANSFORMINFO_H
#define LLVM_ANALYSIS_TARGETTRANSFORMINFO_H		#define LLVM_ANALYSIS_TARGETTRANSFORMINFO_H

		#include "llvm/Analysis/IVDescriptors.h"
#include "llvm/IR/InstrTypes.h"		#include "llvm/IR/InstrTypes.h"
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
#include "llvm/IR/PassManager.h"		#include "llvm/IR/PassManager.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/AtomicOrdering.h"		#include "llvm/Support/AtomicOrdering.h"
#include "llvm/Support/DataTypes.h"		#include "llvm/Support/DataTypes.h"
#include "llvm/Support/InstructionCost.h"		#include "llvm/Support/InstructionCost.h"
#include <functional>		#include <functional>
▲ Show 20 Lines • Show All 1,268 Lines • ▼ Show 20 Lines
/// \returns True if it is legal to vectorize the given load chain.		/// \returns True if it is legal to vectorize the given load chain.
bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes, Align Alignment,		bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes, Align Alignment,
unsigned AddrSpace) const;		unsigned AddrSpace) const;

/// \returns True if it is legal to vectorize the given store chain.		/// \returns True if it is legal to vectorize the given store chain.
bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes, Align Alignment,		bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes, Align Alignment,
unsigned AddrSpace) const;		unsigned AddrSpace) const;

		bool isLegalToVectorizeReduction(RecurKind RecKind, bool Scalable) const;
		fhahnUnsubmitted Done Reply Inline Actions This should probably have a comment,. fhahn: This should probably have a comment,.

		dmgreenUnsubmitted Not Done Reply Inline Actions Does this need to check the type? Does an i128 reduction work, for example? I presume if a <vscale x 4 x float> reduction works then any <vscale x ? x float> will work? dmgreen: Does this need to check the type? Does an i128 reduction work, for example? I presume if a…
		kmclaughlinAuthorUnsubmitted Not Done Reply Inline Actions Hi @dmgreen, thanks for taking a look at this! I've added a check of the recurrence type to isLegalToVectorizeReduction. I think any <vscale x ? x float> reduction will work, I added some tests for legalization of vector reductions as part of D93050. kmclaughlin: Hi @dmgreen, thanks for taking a look at this! I've added a check of the recurrence type to…
/// \returns The new vector factor value if the target doesn't support \p		/// \returns The new vector factor value if the target doesn't support \p
/// SizeInBytes loads or has a better vector factor.		/// SizeInBytes loads or has a better vector factor.
unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,		unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,
unsigned ChainSizeInBytes,		unsigned ChainSizeInBytes,
VectorType *VecTy) const;		VectorType *VecTy) const;

/// \returns The new vector factor value if the target doesn't support \p		/// \returns The new vector factor value if the target doesn't support \p
/// SizeInBytes stores or has a better vector factor.		/// SizeInBytes stores or has a better vector factor.
▲ Show 20 Lines • Show All 323 Lines • ▼ Show 20 Lines	public:
virtual bool isLegalToVectorizeLoad(LoadInst *LI) const = 0;		virtual bool isLegalToVectorizeLoad(LoadInst *LI) const = 0;
virtual bool isLegalToVectorizeStore(StoreInst *SI) const = 0;		virtual bool isLegalToVectorizeStore(StoreInst *SI) const = 0;
virtual bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes,		virtual bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes,
Align Alignment,		Align Alignment,
unsigned AddrSpace) const = 0;		unsigned AddrSpace) const = 0;
virtual bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,		virtual bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes,
Align Alignment,		Align Alignment,
unsigned AddrSpace) const = 0;		unsigned AddrSpace) const = 0;
		virtual bool isLegalToVectorizeReduction(RecurKind RecKind,
		bool Scalable) const = 0;
virtual unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,		virtual unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,
unsigned ChainSizeInBytes,		unsigned ChainSizeInBytes,
VectorType *VecTy) const = 0;		VectorType *VecTy) const = 0;
virtual unsigned getStoreVectorFactor(unsigned VF, unsigned StoreSize,		virtual unsigned getStoreVectorFactor(unsigned VF, unsigned StoreSize,
unsigned ChainSizeInBytes,		unsigned ChainSizeInBytes,
VectorType *VecTy) const = 0;		VectorType *VecTy) const = 0;
virtual bool useReductionIntrinsic(unsigned Opcode, Type *Ty,		virtual bool useReductionIntrinsic(unsigned Opcode, Type *Ty,
ReductionFlags) const = 0;		ReductionFlags) const = 0;
▲ Show 20 Lines • Show All 510 Lines • ▼ Show 20 Lines	bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes, Align Alignment,
return Impl.isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment,		return Impl.isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment,
AddrSpace);		AddrSpace);
}		}
bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes, Align Alignment,		bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes, Align Alignment,
unsigned AddrSpace) const override {		unsigned AddrSpace) const override {
return Impl.isLegalToVectorizeStoreChain(ChainSizeInBytes, Alignment,		return Impl.isLegalToVectorizeStoreChain(ChainSizeInBytes, Alignment,
AddrSpace);		AddrSpace);
}		}
		bool isLegalToVectorizeReduction(RecurKind RecKind,
		bool Scalable) const override {
		return Impl.isLegalToVectorizeReduction(RecKind, Scalable);
		}
unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,		unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,
unsigned ChainSizeInBytes,		unsigned ChainSizeInBytes,
VectorType *VecTy) const override {		VectorType *VecTy) const override {
return Impl.getLoadVectorFactor(VF, LoadSize, ChainSizeInBytes, VecTy);		return Impl.getLoadVectorFactor(VF, LoadSize, ChainSizeInBytes, VecTy);
}		}
unsigned getStoreVectorFactor(unsigned VF, unsigned StoreSize,		unsigned getStoreVectorFactor(unsigned VF, unsigned StoreSize,
unsigned ChainSizeInBytes,		unsigned ChainSizeInBytes,
VectorType *VecTy) const override {		VectorType *VecTy) const override {
▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 680 Lines • ▼ Show 20 Lines	bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes, Align Alignment,
return true;		return true;
}		}

bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes, Align Alignment,		bool isLegalToVectorizeStoreChain(unsigned ChainSizeInBytes, Align Alignment,
unsigned AddrSpace) const {		unsigned AddrSpace) const {
return true;		return true;
}		}

		bool isLegalToVectorizeReduction(RecurKind RecKind,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - bool isLegalToVectorizeReduction(RecurKind RecKind, - bool Scalable) const { return true; } + bool isLegalToVectorizeReduction(RecurKind RecKind, bool Scalable) const { + return true; + } Lint: Pre-merge checks: clang-format: please reformat the code ``` - bool isLegalToVectorizeReduction(RecurKind…
		bool Scalable) const { return true; }

unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,		unsigned getLoadVectorFactor(unsigned VF, unsigned LoadSize,
unsigned ChainSizeInBytes,		unsigned ChainSizeInBytes,
VectorType *VecTy) const {		VectorType *VecTy) const {
return VF;		return VF;
}		}

unsigned getStoreVectorFactor(unsigned VF, unsigned StoreSize,		unsigned getStoreVectorFactor(unsigned VF, unsigned StoreSize,
unsigned ChainSizeInBytes,		unsigned ChainSizeInBytes,
▲ Show 20 Lines • Show All 418 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 1,029 Lines • ▼ Show 20 Lines
	}			}

	bool TargetTransformInfo::isLegalToVectorizeStoreChain(			bool TargetTransformInfo::isLegalToVectorizeStoreChain(
	unsigned ChainSizeInBytes, Align Alignment, unsigned AddrSpace) const {			unsigned ChainSizeInBytes, Align Alignment, unsigned AddrSpace) const {
	return TTIImpl->isLegalToVectorizeStoreChain(ChainSizeInBytes, Alignment,			return TTIImpl->isLegalToVectorizeStoreChain(ChainSizeInBytes, Alignment,
	AddrSpace);			AddrSpace);
	}			}

				bool TargetTransformInfo::isLegalToVectorizeReduction(RecurKind RecKind,
				bool Scalable) const {
				return TTIImpl->isLegalToVectorizeReduction(RecKind, Scalable);
				}

	unsigned TargetTransformInfo::getLoadVectorFactor(unsigned VF,			unsigned TargetTransformInfo::getLoadVectorFactor(unsigned VF,
	unsigned LoadSize,			unsigned LoadSize,
	unsigned ChainSizeInBytes,			unsigned ChainSizeInBytes,
	VectorType *VecTy) const {			VectorType *VecTy) const {
	return TTIImpl->getLoadVectorFactor(VF, LoadSize, ChainSizeInBytes, VecTy);			return TTIImpl->getLoadVectorFactor(VF, LoadSize, ChainSizeInBytes, VecTy);
	}			}

	unsigned TargetTransformInfo::getStoreVectorFactor(unsigned VF,			unsigned TargetTransformInfo::getStoreVectorFactor(unsigned VF,
	▲ Show 20 Lines • Show All 428 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,915 Lines • ▼ Show 20 Lines	case ISD::INSERT_VECTOR_ELT: {
}		}
assert(Tmp <= VTBits && "Failed to determine minimum sign bits");		assert(Tmp <= VTBits && "Failed to determine minimum sign bits");
return Tmp;		return Tmp;
}		}
case ISD::EXTRACT_VECTOR_ELT: {		case ISD::EXTRACT_VECTOR_ELT: {
SDValue InVec = Op.getOperand(0);		SDValue InVec = Op.getOperand(0);
SDValue EltNo = Op.getOperand(1);		SDValue EltNo = Op.getOperand(1);
EVT VecVT = InVec.getValueType();		EVT VecVT = InVec.getValueType();
		if (VecVT.isScalableVector())
		fhahnUnsubmitted Done Reply Inline Actions Can you add a test for this? Also, this seems completely unrelated, can you split it off? fhahn: Can you add a test for this? Also, this seems completely unrelated, can you split it off?
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions I've removed this from the patch, I don't think it's required for the tests here. kmclaughlin: I've removed this from the patch, I don't think it's required for the tests here.
		break;
const unsigned BitWidth = Op.getValueSizeInBits();		const unsigned BitWidth = Op.getValueSizeInBits();
const unsigned EltBitWidth = Op.getOperand(0).getScalarValueSizeInBits();		const unsigned EltBitWidth = Op.getOperand(0).getScalarValueSizeInBits();
const unsigned NumSrcElts = VecVT.getVectorNumElements();		const unsigned NumSrcElts = VecVT.getVectorNumElements();

// If BitWidth > EltBitWidth the value is anyext:ed, and we do not know		// If BitWidth > EltBitWidth the value is anyext:ed, and we do not know
// anything about sign bits. But if the sizes match we can derive knowledge		// anything about sign bits. But if the sizes match we can derive knowledge
// about sign bits from the vector operand.		// about sign bits from the vector operand.
if (BitWidth != EltBitWidth)		if (BitWidth != EltBitWidth)
▲ Show 20 Lines • Show All 6,293 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	public:
void getPeelingPreferences(Loop *L, ScalarEvolution &SE,		void getPeelingPreferences(Loop *L, ScalarEvolution &SE,
TTI::PeelingPreferences &PP);		TTI::PeelingPreferences &PP);

Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
Type *ExpectedType);		Type *ExpectedType);

bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info);		bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info);

bool isLegalMaskedLoadStore(Type *DataType, Align Alignment) {		bool isLegalMaskedLoadStore(Type *DataType, Align Alignment) {
		sdesmalenUnsubmitted Done Reply Inline Actions Can you merge this function with `isLegalScalarTypeForSVEMaskedMemOp` and name it `isLegalElementTypeForSVE`? I think their implementation should be the same (including your check here for `hasBF16`) sdesmalen: Can you merge this function with `isLegalScalarTypeForSVEMaskedMemOp` and name it…
if (!isa<ScalableVectorType>(DataType) \|\| !ST->hasSVE())		if (!isa<ScalableVectorType>(DataType) \|\| !ST->hasSVE())
return false;		return false;
		sdesmalenUnsubmitted Done Reply Inline Actions I forgot to mention that there are no reduction instructions for bfloat, so you'll need to catch out that specific case in `isLegalToVectorizeReduction` sdesmalen: I forgot to mention that there are no reduction instructions for bfloat, so you'll need to…

Type *Ty = cast<ScalableVectorType>(DataType)->getElementType();		Type *Ty = cast<ScalableVectorType>(DataType)->getElementType();
if (Ty->isPointerTy())		if (Ty->isPointerTy())
return true;		return true;

if (Ty->isBFloatTy() \|\| Ty->isHalfTy() \|\|		if (Ty->isBFloatTy() \|\| Ty->isHalfTy() \|\|
Ty->isFloatTy() \|\| Ty->isDoubleTy())		Ty->isFloatTy() \|\| Ty->isDoubleTy())
return true;		return true;
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	public:
bool shouldExpandReduction(const IntrinsicInst *II) const { return false; }		bool shouldExpandReduction(const IntrinsicInst *II) const { return false; }

unsigned getGISelRematGlobalCost() const {		unsigned getGISelRematGlobalCost() const {
return 2;		return 2;
}		}

bool supportsScalableVectors() const { return ST->hasSVE(); }		bool supportsScalableVectors() const { return ST->hasSVE(); }

		bool isLegalToVectorizeReduction(RecurKind RecKind, bool Scalable) const;

bool useReductionIntrinsic(unsigned Opcode, Type *Ty,		bool useReductionIntrinsic(unsigned Opcode, Type *Ty,
TTI::ReductionFlags Flags) const;		TTI::ReductionFlags Flags) const;

int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
bool IsPairwiseForm,		bool IsPairwiseForm,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);

int getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp, int Index,		int getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp, int Index,
VectorType *SubTp);		VectorType *SubTp);
/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,067 Lines • ▼ Show 20 Lines	if (const GetElementPtrInst *GEPInst = dyn_cast<GetElementPtrInst>(U)) {
AllowPromotionWithoutCommonHeader = true;		AllowPromotionWithoutCommonHeader = true;
break;		break;
}		}
}		}
}		}
return Considerable;		return Considerable;
}		}

		bool AArch64TTIImpl::isLegalToVectorizeReduction(RecurKind RecKind,
		bool Scalable) const {
		if (Scalable) {
		sdesmalenUnsubmitted Done Reply Inline Actions nit: bail out early to reduce indentation. if (!Scalable) return true; sdesmalen: nit: bail out early to reduce indentation. if (!Scalable) return true;
		switch (RecKind) {
		case RecurKind::Add:
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - case RecurKind::Add: - case RecurKind::FAdd: - case RecurKind::And: - case RecurKind::Or: - case RecurKind::Xor: - case RecurKind::SMin: - case RecurKind::SMax: - case RecurKind::UMin: - case RecurKind::UMax: - case RecurKind::FMin: 18 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - case RecurKind::Add: - case RecurKind…
		case RecurKind::FAdd:
		case RecurKind::And:
		dmgreenUnsubmitted Not Done Reply Inline Actions Thanks. This looks like it should work for most current types. Are bfloats always supported? It may be better to be more specific in case other smaller-than-64bit float types are added in the future. dmgreen: Thanks. This looks like it should work for most current types. Are bfloats always supported? It…
		david-armUnsubmitted Not Done Reply Inline Actions Is this needed though? If bfloats are in the scalar IR it means that the user has explicitly written code using the SVE ACLE so I'd imagine that all bets are off anyway if they didn't build with bf16 support. I'd also imagine that these would be flagged up as illegal types earlier on in the vectoriser too I think? david-arm: Is this needed though? If bfloats are in the scalar IR it means that the user has explicitly…
		dmgreenUnsubmitted Not Done Reply Inline Actions Hmm. I guess I I don't see the advantage of getting it wrong. Clang isn't the only frontend and the vectorizer needs to take any valid input and not crash or produce code that will later crash. Being specific about which types are supported seems like a better idea to me than hoping it works and hoping that won't change in the future. dmgreen: Hmm. I guess I I don't see the advantage of getting it wrong. Clang isn't the only frontend and…
		david-armUnsubmitted Not Done Reply Inline Actions No that's a fair point and happy for @kmclaughlin to add the check. However, we can't test such a scenario even with hand written IR because the vectoriser crashes without bfloat support: LLVM ERROR: Cannot legalize this vector #8 0x0000ffff959efad8 llvm::TargetLoweringBase::getTypeConversion(llvm::LLVMContext&, llvm::EVT) const (.localalias) (/home/davshe01/upstream/llvm-project/build2/bin/../lib/libLLVMSupport.so.13git+0xcfad8) #9 0x0000ffff959efbd8 llvm::TargetLoweringBase::getTypeLegalizationCost(llvm::DataLayout const&, llvm::Type) const (/home/davshe01/upstream/llvm-project/build2/bin/../lib/libLLVMSupport.so.13git+0xcfbd8) david-arm:* No that's a fair point and happy for @kmclaughlin to add the check. However, we can't test such…
		case RecurKind::Or:
		case RecurKind::Xor:
		case RecurKind::SMin:
		case RecurKind::SMax:
		case RecurKind::UMin:
		case RecurKind::UMax:
		case RecurKind::FMin:
		case RecurKind::FMax:
		return true;
		default:
		return false;
		}
		return false;
		sdesmalenUnsubmitted Done Reply Inline Actions nit: can be removed if you add the early bail out. sdesmalen: nit: can be removed if you add the early bail out.
		}

		return true;
		}

bool AArch64TTIImpl::useReductionIntrinsic(unsigned Opcode, Type *Ty,		bool AArch64TTIImpl::useReductionIntrinsic(unsigned Opcode, Type *Ty,
TTI::ReductionFlags Flags) const {		TTI::ReductionFlags Flags) const {
		if (isa<ScalableVectorType>(Ty))
		david-armUnsubmitted Done Reply Inline Actions Just a thought - if we're excluding FMul from reductions is it worth having an assert here that the op is not fmul? david-arm: Just a thought - if we're excluding FMul from reductions is it worth having an assert here that…
		sdesmalenUnsubmitted Not Done Reply Inline Actions The same should hold for integer Mul. nit: you can better add that to the switch statement below as: case Instruction::Mul: case Instruction::FMul: assert(!isa<ScalableVectorType>(Ty) && "Unexpected ..."); LLVM_FALLTHROUGH; case Instruction::Fadd: ... sdesmalen: The same should hold for integer Mul. nit: you can better add that to the switch statement…
		david-armUnsubmitted Not Done Reply Inline Actions Hi @sdesmalen, just for information the reason I'd asked for an assert here is that if we're still intending to create a target reduction intrinsic at this point with a mul or fmul then something has gone badly wrong and is almost certainly a bug. This is because this function is only ever called at the point where you've already decided that it's legal to reduce a scalable mul operation. The two places where this is called are from SLPVectorizer.cpp:createSimpleTargetReduction and InnerLoopVectorizer::fixReduction (via createTargetReduction). david-arm: Hi @sdesmalen, just for information the reason I'd asked for an assert here is that if we're…
		david-armUnsubmitted Not Done Reply Inline Actions Sorry, please ignore my comment! For some reason I hadn't seen the assert in there. david-arm: Sorry, please ignore my comment! For some reason I hadn't seen the assert in there.
		return true;

auto *VTy = cast<VectorType>(Ty);		auto *VTy = cast<VectorType>(Ty);
unsigned ScalarBits = Ty->getScalarSizeInBits();		unsigned ScalarBits = Ty->getScalarSizeInBits();
switch (Opcode) {		switch (Opcode) {
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor:		case Instruction::Xor:
▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

	Show All 22 Lines

	#ifndef LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H			#ifndef LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H
	#define LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H			#define LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H

	#include "VPlan.h"			#include "VPlan.h"
	#include "llvm/Analysis/LoopInfo.h"			#include "llvm/Analysis/LoopInfo.h"
	#include "llvm/Analysis/TargetLibraryInfo.h"			#include "llvm/Analysis/TargetLibraryInfo.h"
	#include "llvm/Analysis/TargetTransformInfo.h"			#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/Support/InstructionCost.h"

	namespace llvm {			namespace llvm {

	class LoopVectorizationLegality;			class LoopVectorizationLegality;
	class LoopVectorizationCostModel;			class LoopVectorizationCostModel;
	class PredicatedScalarEvolution;			class PredicatedScalarEvolution;
	class VPRecipeBuilder;			class VPRecipeBuilder;

	▲ Show 20 Lines • Show All 130 Lines • ▼ Show 20 Lines
	/// VectorizerParams::VectorizationFactor and VectorizationCostTy.			/// VectorizerParams::VectorizationFactor and VectorizationCostTy.
	/// We need to streamline them.			/// We need to streamline them.

	/// Information about vectorization costs			/// Information about vectorization costs
	struct VectorizationFactor {			struct VectorizationFactor {
	// Vector width with best cost			// Vector width with best cost
	ElementCount Width;			ElementCount Width;
	// Cost of the loop with that width			// Cost of the loop with that width
	unsigned Cost;			InstructionCost Cost;

	// Width 1 means no vectorization, cost 0 means uncomputed cost.			// Width 1 means no vectorization, cost 0 means uncomputed cost.
	static VectorizationFactor Disabled() {			static VectorizationFactor Disabled() {
	return {ElementCount::getFixed(1), 0};			return {ElementCount::getFixed(1), 0};
	}			}

	bool operator==(const VectorizationFactor &rhs) const {			bool operator==(const VectorizationFactor &rhs) const {
	return Width == rhs.Width && Cost == rhs.Cost;			return Width == rhs.Width && Cost == rhs.Cost;
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,268 Lines • ▼ Show 20 Lines	public:
/// that needs to be vectorized. We ignore values that remain scalar such as		/// that needs to be vectorized. We ignore values that remain scalar such as
/// 64 bit loop indices.		/// 64 bit loop indices.
std::pair<unsigned, unsigned> getSmallestAndWidestTypes();		std::pair<unsigned, unsigned> getSmallestAndWidestTypes();

/// \return The desired interleave count.		/// \return The desired interleave count.
/// If interleave count has been specified by metadata it will be returned.		/// If interleave count has been specified by metadata it will be returned.
/// Otherwise, the interleave count is computed and returned. VF and LoopCost		/// Otherwise, the interleave count is computed and returned. VF and LoopCost
/// are the selected vectorization factor and the cost of the selected VF.		/// are the selected vectorization factor and the cost of the selected VF.
unsigned selectInterleaveCount(ElementCount VF, unsigned LoopCost);		unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
		fhahnUnsubmitted Done Reply Inline Actions those changes could also be submitted separately? fhahn: those changes could also be submitted separately?

/// Memory access instruction may be vectorized in more than one way.		/// Memory access instruction may be vectorized in more than one way.
/// Form of instruction after vectorization depends on cost.		/// Form of instruction after vectorization depends on cost.
/// This function takes cost-based decisions for Load/Store instructions		/// This function takes cost-based decisions for Load/Store instructions
/// and collects them in a map. This decisions map is used for building		/// and collects them in a map. This decisions map is used for building
/// the lists of loop-uniform and loop-scalar instructions.		/// the lists of loop-uniform and loop-scalar instructions.
/// The calculated cost is saved with widening decision in order to		/// The calculated cost is saved with widening decision in order to
/// avoid redundant calculations.		/// avoid redundant calculations.
▲ Show 20 Lines • Show All 225 Lines • ▼ Show 20 Lines	bool isLegalGatherOrScatter(Value *V) {
if (!LI && !SI)		if (!LI && !SI)
return false;		return false;
auto *Ty = getMemInstValueType(V);		auto *Ty = getMemInstValueType(V);
Align Align = getLoadStoreAlignment(V);		Align Align = getLoadStoreAlignment(V);
return (LI && isLegalMaskedGather(Ty, Align)) \|\|		return (LI && isLegalMaskedGather(Ty, Align)) \|\|
(SI && isLegalMaskedScatter(Ty, Align));		(SI && isLegalMaskedScatter(Ty, Align));
}		}

		bool isLegalWideningOperation(ElementCount VF) {
		fhahnUnsubmitted Done Reply Inline Actions This also needs a comment. And the name could probably be improved. Maybe `canVectorizeReductions`? fhahn: This also needs a comment. And the name could probably be improved. Maybe…
		david-armUnsubmitted Done Reply Inline Actions nit: Perhaps use `///` here instead of '//' in line with other function comments? david-arm: nit: Perhaps use `///` here instead of '//' in line with other function comments?
		for (auto &Reduction : Legal->getReductionVars()) {
		fhahnUnsubmitted Done Reply Inline Actions nit: `.` at end of sentence. fhahn: nit: `.` at end of sentence.
		RecurrenceDescriptor RdxDesc = Reduction.second;
		if (!TTI.isLegalToVectorizeReduction(RdxDesc.getRecurrenceKind(),
		sdesmalenUnsubmitted Done Reply Inline Actions nit: use `return llvm::all_of(....)` with lambda, instead of loop? sdesmalen: nit: use `return llvm::all_of(....)` with lambda, instead of loop?
		david-armUnsubmitted Not Done Reply Inline Actions nit: I think you can remove the '(' and ')' surrounding the llvm::all_of call here. david-arm: nit: I think you can remove the '(' and ')' surrounding the llvm::all_of call here.
		fhahnUnsubmitted Done Reply Inline Actions nit: `llvm::` should not be required fhahn: nit: `llvm::` should not be required
		VF.isScalable())) {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - VF.isScalable())) { + VF.isScalable())) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - VF.
		LLVM_DEBUG(
		dbgs() << "LV: Not vectorizing. Found invalid reduction type.\n");
		david-armUnsubmitted Not Done Reply Inline Actions It might be worth printing out the recurrence kind here. Do we also want to emit a remark here to help the user understand why it failed to vectorise? david-arm: It might be worth printing out the recurrence kind here. Do we also want to emit a remark here…
		sdesmalenUnsubmitted Done Reply Inline Actions Is it worth just passing the whole Recurrence descriptor and the whole of VF? When passing the whole Recurrence descriptor, in the future the function can also determine whether it can vectorize an ordered reduction (e.g. ordered fadd) in the loop body using some instruction. sdesmalen: Is it worth just passing the whole Recurrence descriptor and the whole of VF? When passing the…
		return false;
		}
		}
		return true;
		}

/// Returns true if \p I is an instruction that will be scalarized with		/// Returns true if \p I is an instruction that will be scalarized with
/// predication. Such instructions include conditional stores and		/// predication. Such instructions include conditional stores and
/// instructions that may divide by zero.		/// instructions that may divide by zero.
/// If a non-zero VF has been calculated, we check if I will be scalarized		/// If a non-zero VF has been calculated, we check if I will be scalarized
/// predication for that VF.		/// predication for that VF.
bool isScalarWithPredication(Instruction *I,		bool isScalarWithPredication(Instruction *I,
ElementCount VF = ElementCount::getFixed(1));		ElementCount VF = ElementCount::getFixed(1));

▲ Show 20 Lines • Show All 3,055 Lines • ▼ Show 20 Lines	if (VF.isVector() && IsPtrLoopInvariant && IsIndexLoopInvariant.all()) {
}		}
}		}
}		}

void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN,		void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN,
RecurrenceDescriptor *RdxDesc,		RecurrenceDescriptor *RdxDesc,
Value *StartV, unsigned UF,		Value *StartV, unsigned UF,
ElementCount VF) {		ElementCount VF) {
assert(!VF.isScalable() && "scalable vectors not yet supported.");
PHINode *P = cast<PHINode>(PN);		PHINode *P = cast<PHINode>(PN);
if (EnableVPlanNativePath) {		if (EnableVPlanNativePath) {
// Currently we enter here in the VPlan-native path for non-induction		// Currently we enter here in the VPlan-native path for non-induction
// PHIs where all control flow is uniform. We simply widen these PHIs.		// PHIs where all control flow is uniform. We simply widen these PHIs.
// Create a vector phi with no operands - the vector phi operands will be		// Create a vector phi with no operands - the vector phi operands will be
// set at the end of vector code generation.		// set at the end of vector code generation.
Type *VecTy =		Type *VecTy =
(VF.isScalar()) ? PN->getType() : VectorType::get(PN->getType(), VF);		(VF.isScalar()) ? PN->getType() : VectorType::get(PN->getType(), VF);
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	case InductionDescriptor::IK_PtrInduction: {

if (Cost->isScalarAfterVectorization(P, VF)) {		if (Cost->isScalarAfterVectorization(P, VF)) {
// This is the normalized GEP that starts counting at zero.		// This is the normalized GEP that starts counting at zero.
Value *PtrInd =		Value *PtrInd =
Builder.CreateSExtOrTrunc(Induction, II.getStep()->getType());		Builder.CreateSExtOrTrunc(Induction, II.getStep()->getType());
// Determine the number of scalars we need to generate for each unroll		// Determine the number of scalars we need to generate for each unroll
// iteration. If the instruction is uniform, we only need to generate the		// iteration. If the instruction is uniform, we only need to generate the
// first lane. Otherwise, we generate all VF values.		// first lane. Otherwise, we generate all VF values.
unsigned Lanes =		unsigned Lanes =
CarolineConcattoUnsubmitted Not Done Reply Inline Actions So once we start to use Scalable vector and we start to use the VF.getKnownMinValue(), shouldn't;t this be multiplied by getMaxVScale()? CarolineConcatto: So once we start to use Scalable vector and we start to use the VF.getKnownMinValue()…
david-armUnsubmitted Not Done Reply Inline Actions This is for vectorise of induction variables. I think we'll have to use a runtime VF that I introduced in D95139 here. I don't think Kerry has to fix this in her patch. david-arm: This is for vectorise of induction variables. I think we'll have to use a runtime VF that I…
Cost->isUniformAfterVectorization(P, VF) ? 1 : VF.getKnownMinValue();		Cost->isUniformAfterVectorization(P, VF) ? 1 : VF.getKnownMinValue();
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
for (unsigned Lane = 0; Lane < Lanes; ++Lane) {		for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
Constant *Idx = ConstantInt::get(PtrInd->getType(),		Constant *Idx = ConstantInt::get(PtrInd->getType(),
Lane + Part * VF.getKnownMinValue());		Lane + Part * VF.getKnownMinValue());
Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx);		Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx);
Value *SclrGep =		Value *SclrGep =
emitTransformedIndex(Builder, GlobalIdx, PSE.getSE(), DL, II);		emitTransformedIndex(Builder, GlobalIdx, PSE.getSE(), DL, II);
Show All 30 Lines	case InductionDescriptor::IK_PtrInduction: {
NewPointerPhi->addIncoming(InductionGEP, LoopLatch);		NewPointerPhi->addIncoming(InductionGEP, LoopLatch);

// Create UF many actual address geps that use the pointer		// Create UF many actual address geps that use the pointer
// phi as base and a vectorized version of the step value		// phi as base and a vectorized version of the step value
// (<step0, ..., stepN>) as offset.		// (<step0, ..., stepN>) as offset.
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
SmallVector<Constant *, 8> Indices;		SmallVector<Constant *, 8> Indices;
// Create a vector of consecutive numbers from zero to VF.		// Create a vector of consecutive numbers from zero to VF.
for (unsigned i = 0; i < VF.getKnownMinValue(); ++i)		for (unsigned i = 0; i < VF.getKnownMinValue(); ++i)
CarolineConcattoUnsubmitted Not Done Reply Inline Actions Same here, should we not need to multiply by getMaxVScale()? CarolineConcatto: Same here, should we not need to multiply by getMaxVScale()?
Indices.push_back(		Indices.push_back(
ConstantInt::get(PhiType, i + Part * VF.getKnownMinValue()));		ConstantInt::get(PhiType, i + Part * VF.getKnownMinValue()));
Constant *StartOffset = ConstantVector::get(Indices);		Constant *StartOffset = ConstantVector::get(Indices);

Value *GEP = Builder.CreateGEP(		Value *GEP = Builder.CreateGEP(
ScStValueType->getPointerElementType(), NewPointerPhi,		ScStValueType->getPointerElementType(), NewPointerPhi,
Builder.CreateMul(		Builder.CreateMul(
StartOffset,		StartOffset,
▲ Show 20 Lines • Show All 892 Lines • ▼ Show 20 Lines	if (IgnoreScalableUserVF) {
});		});
}		}

// Beyond this point two scenarios are handled. If UserVF isn't specified		// Beyond this point two scenarios are handled. If UserVF isn't specified
// then a suitable VF is chosen. If UserVF is specified and there are		// then a suitable VF is chosen. If UserVF is specified and there are
// dependencies, check if it's legal. However, if a UserVF is specified and		// dependencies, check if it's legal. However, if a UserVF is specified and
// there are no dependencies, then there's nothing to do.		// there are no dependencies, then there's nothing to do.
if (UserVF.isNonZero() && !IgnoreScalableUserVF &&		if (UserVF.isNonZero() && !IgnoreScalableUserVF &&
Legal->isSafeForAnyVectorWidth())		Legal->isSafeForAnyVectorWidth())
		david-armUnsubmitted Done Reply Inline Actions I think this looks much better now you're just checking reductions only once and early on - thanks for this! However, I think you might need to move this check down to line 5677 where we return UserVF. So the reason I think this is because if we have a loop that contains memory dependences and reductions in the same loop we want to ensure we always do the reduction checks regardless. For example, Legal->isSafeForAnyVectorWidth() could return false and then in the code below we may successfully reduce the UserVF from <vscale x 8 x float> to <vscale x 4 x float> without ever calling canVectorizeReductions. david-arm: I think this looks much better now you're just checking reductions only once and early on…
		fhahnUnsubmitted Done Reply Inline Actions please also add a test for this scenario. fhahn: please also add a test for this scenario.
return UserVF;		return UserVF;

MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);		MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
unsigned SmallestType, WidestType;		unsigned SmallestType, WidestType;
std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();		std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();
unsigned WidestRegister = TTI.getRegisterBitWidth(true);		unsigned WidestRegister = TTI.getRegisterBitWidth(true);
		fhahnUnsubmitted Done Reply Inline Actions I think you should be bale to use `reportVectorizationFailure` to print to `dbgs()` and generate a remark with the same message fhahn: I think you should be bale to use `reportVectorizationFailure` to print to `dbgs()` and…

// Get the maximum safe dependence distance in bits computed by LAA.		// Get the maximum safe dependence distance in bits computed by LAA.
// It is computed by MaxVF * sizeOf(type) * 8, where type is taken from		// It is computed by MaxVF * sizeOf(type) * 8, where type is taken from
// the memory accesses that is most restrictive (involved in the smallest		// the memory accesses that is most restrictive (involved in the smallest
// dependence distance).		// dependence distance).
unsigned MaxSafeVectorWidthInBits = Legal->getMaxSafeVectorWidthInBits();		unsigned MaxSafeVectorWidthInBits = Legal->getMaxSafeVectorWidthInBits();

// If the user vectorization factor is legally unsafe, clamp it to a safe		// If the user vectorization factor is legally unsafe, clamp it to a safe
// value. Otherwise, return as is.		// value. Otherwise, return as is.
if (UserVF.isNonZero() && !IgnoreScalableUserVF) {		if (UserVF.isNonZero() && !IgnoreScalableUserVF) {
unsigned MaxSafeElements =		unsigned MaxSafeElements =
PowerOf2Floor(MaxSafeVectorWidthInBits / WidestType);		PowerOf2Floor(MaxSafeVectorWidthInBits / WidestType);
ElementCount MaxSafeVF = ElementCount::getFixed(MaxSafeElements);		ElementCount MaxSafeVF = ElementCount::getFixed(MaxSafeElements);

if (UserVF.isScalable()) {		if (UserVF.isScalable()) {
Optional<unsigned> MaxVScale = TTI.getMaxVScale();		Optional<unsigned> MaxVScale = TTI.getMaxVScale();

// Scale VF by vscale before checking if it's safe.		// Scale VF by vscale before checking if it's safe.
MaxSafeVF = ElementCount::getScalable(		MaxSafeVF = ElementCount::getScalable(
MaxVScale ? (MaxSafeElements / MaxVScale.getValue()) : 0);		MaxVScale ? (MaxSafeElements / MaxVScale.getValue()) : 0);

if (MaxSafeVF.isZero()) {		if (MaxSafeVF.isZero()) {
		david-armUnsubmitted Done Reply Inline Actions I wonder if it's worth bailing out even earlier, i.e. in the same place as above where you check initially? I think the main benefit to bailing out here is if you can reduce the VF to something smaller so that it becomes legal. However, I think for reductions changing the VF won't make a difference in practice. david-arm: I wonder if it's worth bailing out even earlier, i.e. in the same place as above where you…
// The dependence distance is too small to use scalable vectors,		// The dependence distance is too small to use scalable vectors,
// fallback on fixed.		// fallback on fixed.
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Max legal vector width too small, scalable vectorization "		<< "LV: Max legal vector width too small, scalable vectorization "
"unfeasible. Using fixed-width vectorization instead.\n");		"unfeasible. Using fixed-width vectorization instead.\n");
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemarkAnalysis(DEBUG_TYPE, "ScalableVFUnfeasible",		return OptimizationRemarkAnalysis(DEBUG_TYPE, "ScalableVFUnfeasible",
TheLoop->getStartLoc(),		TheLoop->getStartLoc(),
		david-armUnsubmitted Done Reply Inline Actions nit: Perhaps use "operations" here instead of types? I'm thinking that the user probably isn't aware of the RecurrenceKind so type might not make as much sense? david-arm: nit: Perhaps use "operations" here instead of types? I'm thinking that the user probably isn't…
TheLoop->getHeader())		TheLoop->getHeader())
<< "Max legal vector width too small, scalable vectorization "		<< "Max legal vector width too small, scalable vectorization "
<< "unfeasible. Using fixed-width vectorization instead.";		<< "unfeasible. Using fixed-width vectorization instead.";
});		});
return computeFeasibleMaxVF(		return computeFeasibleMaxVF(
ConstTripCount, ElementCount::getFixed(UserVF.getKnownMinValue()));		ConstTripCount, ElementCount::getFixed(UserVF.getKnownMinValue()));
}		}
}		}
▲ Show 20 Lines • Show All 328 Lines • ▼ Show 20 Lines	for (Instruction &I : BB->instructionsWithoutDebug()) {
MaxWidth = std::max(MaxWidth,		MaxWidth = std::max(MaxWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
}		}
}		}

return {MinWidth, MaxWidth};		return {MinWidth, MaxWidth};
}		}

unsigned LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,		unsigned LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -unsigned LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF, - InstructionCost LoopCost) { +unsigned +LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF, + InstructionCost LoopCost) { Lint: Pre-merge checks: clang-format: please reformat the code ``` -unsigned LoopVectorizationCostModel…
unsigned LoopCost) {		InstructionCost LoopCost) {
// -- The interleave heuristics --		// -- The interleave heuristics --
// We interleave the loop in order to expose ILP and reduce the loop overhead.		// We interleave the loop in order to expose ILP and reduce the loop overhead.
// There are many micro-architectural considerations that we can't predict		// There are many micro-architectural considerations that we can't predict
// at this level. For example, frontend pressure (on decode or fetch) due to		// at this level. For example, frontend pressure (on decode or fetch) due to
// code size, or the number and capabilities of the execution ports.		// code size, or the number and capabilities of the execution ports.
//		//
// We use the following heuristics to select the interleave count:		// We use the following heuristics to select the interleave count:
// 1. If the code has reductions, then we interleave to break the cross		// 1. If the code has reductions, then we interleave to break the cross
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,

// If we did not calculate the cost for VF (because the user selected the VF)		// If we did not calculate the cost for VF (because the user selected the VF)
// then we calculate the cost of VF here.		// then we calculate the cost of VF here.
if (LoopCost == 0) {		if (LoopCost == 0) {
assert(expectedCost(VF).first.isValid() && "Expected a valid cost");		assert(expectedCost(VF).first.isValid() && "Expected a valid cost");
LoopCost = *expectedCost(VF).first.getValue();		LoopCost = *expectedCost(VF).first.getValue();
}		}

assert(LoopCost && "Non-zero loop cost expected");		assert(LoopCost.getValue() && "Non-zero loop cost expected");
		CarolineConcattoUnsubmitted Done Reply Inline Actions I believe we can use LoopCost.isValid(), here! CarolineConcatto: I believe we can use LoopCost.isValid(), here!
		david-armUnsubmitted Done Reply Inline Actions I think since we're changing LoopCost to be InstructionCost we can change the line above too from LoopCost = expectedCost(VF).first.getValue(); to LoopCost = expectedCost(VF).first; david-arm:* I think since we're changing LoopCost to be InstructionCost we can change the line above too…

// Interleave if we vectorized this loop and there is a reduction that could		// Interleave if we vectorized this loop and there is a reduction that could
// benefit from interleaving.		// benefit from interleaving.
if (VF.isVector() && HasReductions) {		if (VF.isVector() && HasReductions) {
LLVM_DEBUG(dbgs() << "LV: Interleaving because of reductions.\n");		LLVM_DEBUG(dbgs() << "LV: Interleaving because of reductions.\n");
return IC;		return IC;
}		}

// Note that if we've already vectorized the loop we will have done the		// Note that if we've already vectorized the loop we will have done the
// runtime check and so interleaving won't require further checks.		// runtime check and so interleaving won't require further checks.
bool InterleavingRequiresRuntimePointerCheck =		bool InterleavingRequiresRuntimePointerCheck =
(VF.isScalar() && Legal->getRuntimePointerChecking()->Need);		(VF.isScalar() && Legal->getRuntimePointerChecking()->Need);

// We want to interleave small loops in order to reduce the loop overhead and		// We want to interleave small loops in order to reduce the loop overhead and
// potentially expose ILP opportunities.		// potentially expose ILP opportunities.
LLVM_DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n'		LLVM_DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n'
<< "LV: IC is " << IC << '\n'		<< "LV: IC is " << IC << '\n'
<< "LV: VF is " << VF << '\n');		<< "LV: VF is " << VF << '\n');
const bool AggressivelyInterleaveReductions =		const bool AggressivelyInterleaveReductions =
TTI.enableAggressiveInterleaving(HasReductions);		TTI.enableAggressiveInterleaving(HasReductions);
if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) {		if (!InterleavingRequiresRuntimePointerCheck && (unsigned)*LoopCost.getValue() < SmallLoopCost) {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - if (!InterleavingRequiresRuntimePointerCheck && (unsigned)LoopCost.getValue() < SmallLoopCost) { + if (!InterleavingRequiresRuntimePointerCheck && + (unsigned)LoopCost.getValue() < SmallLoopCost) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - if (!InterleavingRequiresRuntimePointerCheck &&…
		CarolineConcattoUnsubmitted Not Done Reply Inline Actions Can you change SmallLoopCost to be instruction cost as LoopCost, so you don't need to use LoopCost.getValue()? And I believe that in the std::min you will not need to use getValue CarolineConcatto:* Can you change SmallLoopCost to be instruction cost as LoopCost, so you don't need to use…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Hi @CarolineConcatto, thanks for your suggestions on InstructionCost! I didn't change the SmallLoopCost flag to be an instruction cost in the last revision as this caused tests which use -small-loop-cost to fail (e.g. LoopVectorize/unroll_novec.ll) kmclaughlin: Hi @CarolineConcatto, thanks for your suggestions on InstructionCost! I didn't change the…
// We assume that the cost overhead is 1 and we use the cost model		// We assume that the cost overhead is 1 and we use the cost model
// to estimate the cost of the loop and interleave until the cost of the		// to estimate the cost of the loop and interleave until the cost of the
// loop overhead is about 5% of the cost of the loop.		// loop overhead is about 5% of the cost of the loop.
unsigned SmallIC =		unsigned SmallIC =
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - unsigned SmallIC = - std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / (LoopCost.getValue()))); + unsigned SmallIC = std::min( + IC, (unsigned)PowerOf2Floor(SmallLoopCost / (LoopCost.getValue()))); Lint: Pre-merge checks: clang-format: please reformat the code ``` - unsigned SmallIC = - std::min(IC…
std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost));		std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / *(LoopCost.getValue())));

// Interleave until store/load ports (estimated by max interleave count) are		// Interleave until store/load ports (estimated by max interleave count) are
// saturated.		// saturated.
unsigned NumStores = Legal->getNumStores();		unsigned NumStores = Legal->getNumStores();
unsigned NumLoads = Legal->getNumLoads();		unsigned NumLoads = Legal->getNumLoads();
unsigned StoresIC = IC / (NumStores ? NumStores : 1);		unsigned StoresIC = IC / (NumStores ? NumStores : 1);
unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);		unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);

▲ Show 20 Lines • Show All 1,445 Lines • ▼ Show 20 Lines	LLVM_DEBUG(
"which requires masked-interleaved support.\n");		"which requires masked-interleaved support.\n");
if (CM.InterleaveInfo.invalidateGroups())		if (CM.InterleaveInfo.invalidateGroups())
// Invalidating interleave groups also requires invalidating all decisions		// Invalidating interleave groups also requires invalidating all decisions
// based on them, which includes widening decisions and uniform and scalar		// based on them, which includes widening decisions and uniform and scalar
// values.		// values.
CM.invalidateCostModelingDecisions();		CM.invalidateCostModelingDecisions();
}		}

		if (!CM.isLegalWideningOperation(UserVF))
		fhahnUnsubmitted Not Done Reply Inline Actions This should only be checked in the code handling `UserVF` below? Also, This seems like a property that generally limits to vectorization factor to fixed-width vectorization factors and would be good to check beforehand. Would it be possible to just limit vectorization factors to fixed width factors in `computeFeasibleMaxVF`? This way, we won't need extra checks once automatically picked VFs are supported. You'd also won't need any extra code in the caller of `::plan`. This is similar to how we deal with other 'legality' properties that depend on the vectorization factor, like dependencies that may limit the vectorization factor. fhahn: This should only be checked in the code handling `UserVF` below? Also, This seems like a…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Thanks for this suggestion, @fhahn. I've moved the canVectorizeReductions check to `computeFeasibleMaxVF` & updated the affected test in scalable_reductions.ll, where we can use fixed-width vectorization instead (`@mul`) kmclaughlin: Thanks for this suggestion, @fhahn. I've moved the canVectorizeReductions check to…
		return {{UserVF, InstructionCost::getInvalid()}};

ElementCount MaxVF = MaybeMaxVF.getValue();		ElementCount MaxVF = MaybeMaxVF.getValue();
assert(MaxVF.isNonZero() && "MaxVF is zero.");		assert(MaxVF.isNonZero() && "MaxVF is zero.");

bool UserVFIsLegal = ElementCount::isKnownLE(UserVF, MaxVF);		bool UserVFIsLegal = ElementCount::isKnownLE(UserVF, MaxVF);
if (!UserVF.isZero() &&		if (!UserVF.isZero() &&
(UserVFIsLegal \|\| (UserVF.isScalable() && MaxVF.isScalable()))) {		(UserVFIsLegal \|\| (UserVF.isScalable() && MaxVF.isScalable()))) {
// FIXME: MaxVF is temporarily used inplace of UserVF for illegal scalable		// FIXME: MaxVF is temporarily used inplace of UserVF for illegal scalable
// VFs here, this should be reverted to only use legal UserVFs once the		// VFs here, this should be reverted to only use legal UserVFs once the
// loop below supports scalable VFs.		// loop below supports scalable VFs.

		CarolineConcattoUnsubmitted Done Reply Inline Actions nit CarolineConcatto: nit
ElementCount VF = UserVFIsLegal ? UserVF : MaxVF;		ElementCount VF = UserVFIsLegal ? UserVF : MaxVF;
LLVM_DEBUG(dbgs() << "LV: Using " << (UserVFIsLegal ? "user" : "max")		LLVM_DEBUG(dbgs() << "LV: Using " << (UserVFIsLegal ? "user" : "max")
<< " VF " << VF << ".\n");		<< " VF " << VF << ".\n");
assert(isPowerOf2_32(VF.getKnownMinValue()) &&		assert(isPowerOf2_32(VF.getKnownMinValue()) &&
"VF needs to be a power of two");		"VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
CM.selectUserVectorizationFactor(VF);		CM.selectUserVectorizationFactor(VF);
▲ Show 20 Lines • Show All 1,746 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
Optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC);		Optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC);

VectorizationFactor VF = VectorizationFactor::Disabled();		VectorizationFactor VF = VectorizationFactor::Disabled();
unsigned IC = 1;		unsigned IC = 1;

if (MaybeVF) {		if (MaybeVF) {
VF = *MaybeVF;		VF = *MaybeVF;
// Select the interleave count.		// Select the interleave count.
		if (VF.Cost.isValid())
IC = CM.selectInterleaveCount(VF.Width, VF.Cost);		IC = CM.selectInterleaveCount(VF.Width, VF.Cost);
		CarolineConcattoUnsubmitted Done Reply Inline Actions nit CarolineConcatto: nit
}		}

// Identify the diagnostic messages that should be produced.		// Identify the diagnostic messages that should be produced.
std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;		std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;
bool VectorizeLoop = true, InterleaveLoop = true;		bool VectorizeLoop = true, InterleaveLoop = true;
if (Requirements.doesNotMeet(F, L, Hints)) {		if (Requirements.doesNotMeet(F, L, Hints)) {
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: loop did not meet vectorization "		LLVM_DEBUG(dbgs() << "LV: Not vectorizing: loop did not meet vectorization "
"requirements.\n");		"requirements.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

		if (!VF.Cost.isValid()) {
		LLVM_DEBUG(dbgs() << "LV: Not vectorizing: The cost-model indicates that "
		fhahnUnsubmitted Done Reply Inline Actions This message seems a bit odd. I think the cost model should just be responsible for assigning a cost, not deciding whether it is possible to vectorize or not; that's the job of the legality checks. Please see my comment above, the could probably done in `computeFeasibleMaxVF`, which technically is part of the cost model, but is the first step and applies other legality constraints as well which limit the vectorization factor. fhahn: This message seems a bit odd. I think the cost model should just be responsible for assigning a…
		"vectorization is not possible.\n");
		VectorizeLoop = false;
		david-armUnsubmitted Not Done Reply Inline Actions Similar to an earlier comment, a remark here would be good I think. david-arm: Similar to an earlier comment, a remark here would be good I think.
		return false;
		}

if (VF.Width.isScalar()) {		if (VF.Width.isScalar()) {
LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");		LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
VecDiagMsg = std::make_pair(		VecDiagMsg = std::make_pair(
"VectorizationNotBeneficial",		"VectorizationNotBeneficial",
"the cost-model indicates that vectorization is not beneficial");		"the cost-model indicates that vectorization is not beneficial");
VectorizeLoop = false;		VectorizeLoop = false;
}		}

▲ Show 20 Lines • Show All 278 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/scalable_reductions.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -transform-warning -mtriple aarch64-unknown-linux-gnu -mattr=+sve -debug-only=loop-vectorize -S 2>&1 \| FileCheck %s -check-prefix=CHECK
				david-armUnsubmitted Done Reply Inline Actions Needs a "REQUIRES: asserts" here I think because you're relying upon debug output. Also, since you're explicitly adding "-mattr=+sve" here I think you'll either have to: Make the test generic work for all targets (this test will fail on some builds due to lack of AArch64 support), or Move the test for LoopVectorize/AArch64 david-arm: Needs a "REQUIRES: asserts" here I think because you're relying upon debug output. Also, since…
				kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Added `REQUIRES: asserts` & moved the test to `Transforms/LoopVectorize/AArch64` kmclaughlin: Added `REQUIRES: asserts` & moved the test to `Transforms/LoopVectorize/AArch64`
				; RUN: opt < %s -loop-vectorize -transform-warning -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S 2>&1 \| FileCheck %s -check-prefix=CHECK-WARN

				; Reduction can be vectorized

				; ADD

				; int sum = 0;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				fhahnUnsubmitted Done Reply Inline Actions Personally I don't think the C source code adds much value. The IR is very compact and it should be obvious from the IR & test name what is going on. Also, the IR that clang generates can change, clang options may change, pragmas may change and so on. fhahn: Personally I don't think the C source code adds much value. The IR is very compact and it…
				; for (int i = 0; i < n; ++i)
				; sum += a[i];
				; return sum;

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				david-armUnsubmitted Done Reply Inline Actions I wonder if it's worth adding CHECK lines for the resulting IR to show we've vectorised the loop using reductions and checking we have the right structure, i.e. vector.body, middle.block, etc? david-arm: I wonder if it's worth adding CHECK lines for the resulting IR to show we've vectorised the…
				define dso_local i32 @add(i32* nocapture %a, i32* nocapture readonly %b, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end
				fhahnUnsubmitted Not Done Reply Inline Actions this should not be needed for the test. fhahn: this should not be needed for the test.

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				fhahnUnsubmitted Done Reply Inline Actions this should not be needed for the test, you can just pass `%n` as `i64`. fhahn: this should not be needed for the test, you can just pass `%n` as `i64`.
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				fhahnUnsubmitted Done Reply Inline Actions nit: can strip `indvars` from the name to mark things more compact. fhahn: nit: can strip `indvars` from the name to mark things more compact.
				%sum.07 = phi i32 [ 2, %for.body.preheader ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%add = add nsw i32 %0, %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end: ; preds = %for.body, %entry
				%sum.0.lcssa = phi i32 [ 2, %entry ], [ %add, %for.body ]
				ret i32 %sum.0.lcssa
				}

				; OR

				; int foo(int * __restrict__ a, int * __restrict__ inv, int b, int c, int n) {
				; int sum = 0;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum \|= a[i];
				; return sum;
				; }

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local i32 @or(i32* nocapture %a, i32* nocapture readonly %b, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi i32 [ 2, %for.body.preheader ], [ %or, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%or = or i32 %0, %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end: ; preds = %for.body, %entry
				%sum.0.lcssa = phi i32 [ 2, %entry ], [ %or, %for.body ]
				ret i32 %sum.0.lcssa
				}

				; AND

				; int foo(int * __restrict__ a, int * __restrict__ inv, int b, int c, int n) {
				; int sum = 0;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum &= a[i];
				; return sum;
				; }

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local i32 @and(i32* nocapture %a, i32* nocapture readonly %b, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi i32 [ 2, %for.body.preheader ], [ %and, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%and = and i32 %0, %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end: ; preds = %for.body, %entry
				%sum.0.lcssa = phi i32 [ 2, %entry ], [ %and, %for.body ]
				ret i32 %sum.0.lcssa
				}

				; XOR

				; int sum = 0;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum ^= a[i];
				; return sum;

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local i32 @xor(i32* nocapture %a, i32* nocapture readonly %b, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi i32 [ 2, %for.body.preheader ], [ %xor, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%xor = xor i32 %0, %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end: ; preds = %for.body, %entry
				%sum.0.lcssa = phi i32 [ 2, %entry ], [ %xor, %for.body ]
				ret i32 %sum.0.lcssa
				}

				; SMIN

				; int foo(int * __restrict__ a, int * __restrict__ inv, int b, int c, int n) {
				; int sum = 1;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum = std::min(sum, a[i]);
				; return sum;
				; }

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local i32 @smin(i32* nocapture %a, i32* nocapture readonly %b, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.010 = phi i32 [ 2, %for.body.preheader ], [ %.sroa.speculated, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%cmp.i = icmp slt i32 %0, %sum.010
				%.sroa.speculated = select i1 %cmp.i, i32 %0, i32 %sum.010
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				%sum.0.lcssa = phi i32 [ 1, %entry ], [ %.sroa.speculated, %for.body ]
				ret i32 %sum.0.lcssa
				}

				; UMAX

				; unsigned foo(unsigned * __restrict__ a, int n) {
				; unsigned sum = 1;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum = std::min(sum, a[i]);
				; return sum;
				; }

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local i32 @umax(i32* nocapture %a, i32* nocapture readonly %b, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.010 = phi i32 [ 2, %for.body.preheader ], [ %.sroa.speculated, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%cmp.i = icmp ugt i32 %0, %sum.010
				%.sroa.speculated = select i1 %cmp.i, i32 %0, i32 %sum.010
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				%sum.0.lcssa = phi i32 [ 1, %entry ], [ %.sroa.speculated, %for.body ]
				ret i32 %sum.0.lcssa
				}

				; FADD

				; float foo(float * __restrict__ a, int n) {
				; float sum = 0;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum += a[i];
				; return sum;
				; }

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local float @fadd(float* noalias nocapture readonly %a, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader:
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %for.body.preheader ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%add = fadd float %0, %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
				ret float %sum.0.lcssa
				}

				; FADD (FAST)

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local float @fadd_fast(float* noalias nocapture readonly %a, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader:
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %for.body.preheader ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%add = fadd fast float %0, %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
				ret float %sum.0.lcssa
				}

				; FMIN (FAST)

				; float foo(float * __restrict__ a, int n) {
				; float sum = 2;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum = std::min(sum, a[i]);
				; return sum;
				; }

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local float @fmin_fast(float* noalias nocapture readonly %a, i32 %n) #0 {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader:
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %for.body.preheader ], [ %.sroa.speculated, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp.i = fcmp fast olt float %0, %sum.07
				%.sroa.speculated = select i1 %cmp.i, float %0, float %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]
				ret float %sum.0.lcssa
				}

				; FMAX (FAST)

				; float foo(float * __restrict__ a, int n) {
				; float sum = 2;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum = std::max(sum, a[i]);
				; return sum;
				; }

				; CHECK: LV: Found a vectorizable loop (vscale x 8)
				; CHECK: LV: Interleave Count is 2
				; CHECK: Setting best plan to VF=vscale x 8, UF=2
				define dso_local float @fmax_fast(float* noalias nocapture readonly %a, i32 %n) #0 {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader:
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %for.body.preheader ], [ %.sroa.speculated, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp.i = fcmp fast ogt float %0, %sum.07
				%.sroa.speculated = select i1 %cmp.i, float %0, float %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]
				ret float %sum.0.lcssa
				}

				; Reduction cannot be vectorized

				; MUL

				; int sum = 2;
				; #pragma clang loop vectorize_width(8, scalable) interleave_count(2)
				; for (int i = 0; i < n; ++i)
				; sum *= a[i];
				; return sum;

				; CHECK-WARN: warning: <unknown>:0:0: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
				define dso_local i32 @mul(i32* nocapture %a, i32* nocapture readonly %b, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi i32 [ 2, %for.body.preheader ], [ %mul, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%mul = mul nsw i32 %0, %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end: ; preds = %for.body, %entry
				%sum.0.lcssa = phi i32 [ 2, %entry ], [ %mul, %for.body ]
				ret i32 %sum.0.lcssa
				}

				; FMIN

				; CHECK-WARN: warning: <unknown>:0:0: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
				define dso_local float @fmin(float* noalias nocapture readonly %a, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader:
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %for.body.preheader ], [ %.sroa.speculated, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp.i = fcmp olt float %0, %sum.07
				%.sroa.speculated = select i1 %cmp.i, float %0, float %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]
				ret float %sum.0.lcssa
				}

				; FMAX

				; CHECK-WARN: warning: <unknown>:0:0: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
				define dso_local float @fmax(float* noalias nocapture readonly %a, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.end

				for.body.preheader:
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.07 = phi float [ 0.000000e+00, %for.body.preheader ], [ %.sroa.speculated, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp.i = fcmp ogt float %0, %sum.07
				%.sroa.speculated = select i1 %cmp.i, float %0, float %sum.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

				for.end:
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]
				ret float %sum.0.lcssa
				}

				attributes #0 = { "no-nans-fp-math"="true" }

				!0 = distinct !{!0, !1, !2, !3, !4}
				!1 = !{!"llvm.loop.vectorize.width", i32 8}
				!2 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
				!3 = !{!"llvm.loop.interleave.count", i32 2}
				!4 = !{!"llvm.loop.vectorize.enable", i1 true}