This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
3/6
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
6/12
LoopVectorizationLegality.cpp
3/9
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
4/6
scalable-strict-fadd.ll
7/12
strict-fadd.ll

Differential D101836

[LoopVectorize] Enable strict reductions when allowReordering() returns false
ClosedPublic

Authored by kmclaughlin on May 4 2021, 7:38 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
dmgreen
fhahn
peterwaller-arm
spatel

Commits

rG9f76a8526010: [LoopVectorize] Enable strict reductions when allowReordering() returns false

Summary

When loop hints are passed via metadata, the allowReordering function
in LoopVectorizationLegality will allow the order of floating point
operations to be changed:

bool allowReordering() const {
  // When enabling loop hints are provided we allow the vectorizer to change
  // the order of operations that is given by the scalar loop. This is not
  // enabled by default because can be unsafe or inefficient.

The -enable-strict-reductions flag introduced in D98435 will currently only
vectorize reductions in-loop if hints are used, since canVectorizeFPMath()
will return false if reordering is not allowed.

This patch changes canVectorizeFPMath() to query whether it is safe to
vectorize the loop with ordered reductions if no hints are used. For
testing purposes, an additional flag (-hints-allow-reordering) has been
added to disable the reordering behaviour described above.

Diff Detail

Event Timeline

kmclaughlin created this revision.May 4 2021, 7:38 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptMay 4 2021, 7:38 AM

kmclaughlin requested review of this revision.May 4 2021, 7:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2021, 7:38 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B102525: Diff 342733.May 4 2021, 8:19 AM

sdesmalen added inline comments.May 4 2021, 1:38 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9804–9805	nit: indentation, please use clang-format.
9810	Why do all of the reductions have to be ordered for the LV to be able to vectorize FP math? (e.g. if there is an integer reduction and an ordered FP reduction, it would now choose not to vectorize based on this condition)
9919–9920	This condition is a bit odd. Should `canVectorizeOrderedFPMath` just contain the call to `Requirements.canVectorizeFPMath` instead? i.e. in order to vectorize ordered FP math, it must at least be able to vectorize FP math.

david-arm added inline comments.May 5 2021, 12:49 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9810	I guess it might be worth adding a test for this too then, i.e. having a loop with both an integer and FP reduction and ensure we vectorise with ordered reductions.
9919–9920	I think the existing canVectorizeFPMath function is badly named because it actually checks for reordering: bool canVectorizeFPMath(const LoopVectorizeHints &Hints) const { return !ExactFPMathInst \|\| Hints.allowReordering(); } so the logic in Kerry's patch is something like this: Is this an exact FP math instruction? If not -> vectorise, else Do hints permit reordering? If so -> vectorise, else Can we vectorise with ordered reductions? If not -> emit remark. It probably is possible to combine these into a single LoopVectorizationLegality::canVectorizeFPMath function that does all the above, since that class does have access to the Requirements I think.

Addressing review comments from @sdesmalen & @david-arm:

Merged canVectorizeFPMath with canVectorizeOrderedFPMath in LoopVectorizationLegality
Only check the IsOrdered flag of the RecurrenceDescriptor if hasExactFPMath() is true.
Added a test with different types of reductions (integer add & FP add) that we should be able to vectorize with the -enable-strict-reductions flag

Also added an extra RUN line to both strict-fadd.ll & scalable-strict-fadd.ll to test the changes made to allowReordering (i.e. changing EC.getKnownMinValue() > 1 to EC.isScalar()).

kmclaughlin added inline comments.May 5 2021, 7:50 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9810	Hi @sdesmalen, we should only need the FP reductions in the loop to be ordered. I've changed this so that only reductions where `hasExactFPMath()` is true need to be ordered & added a test for this scenario to strict-fadd.ll

david-arm added inline comments.May 5 2021, 7:57 AM

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
258	I think we can avoid passing in the Hints here as they are already a member of the class with the same name?

sdesmalen added inline comments.May 5 2021, 8:06 AM

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
146	is this change necessary?
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
865	Should a hint of VF=1 really lead to the diagnostic `"loop not vectorized: cannot prove it is safe to reorder floating-point operations"`?
865–866	This will also return false if there are no reductions at all, or if all reductions are unordered?
868–871	How about returning true if for each reduction variable, any of the following conditions is true: The reduction is no ExactFPMath instruction for the reduction. The reduction is unordered. EnableStrictReductions is true.
872	I'd prefer the default case to be `return false;`, i.e. when we cannot explicitly determine it is safe, we assume it isn't safe. That would handle the case where `Requirements->getExactFPInst()` is true, but it isn't an instruction used in the reduction. (although I don't know if that would ever happen?)
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9919	You don't need to pass EnableStrictReductions, since it is defined in the same file?

david-arm added inline comments.May 5 2021, 8:10 AM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
16	Thanks a lot for adding the VF=vscale x 1 case here, but perhaps `CHECK-SCALAR` should be `CHECK-VF1U1`, since we're still vectorising? Also, it's probably worth adding an extra CHECK line for at least one instruction that shows the "<vscale x 1 ..." - maybe the `call float ...` instruction?

david-arm added inline comments.May 5 2021, 8:17 AM

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
146	It's fixing a missing case where we weren't previously allowing reordering for scalable VF=vscale x 1. I think it's worth fixing, but maybe it doesn't have to live in this patch?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9919	I think it has to be passed since `EnableStrictReductions` lives in LoopVectorize.cpp and canVectorizeFPMath lives in LoopVectorizationLegality.cpp.

sdesmalen added inline comments.May 5 2021, 8:30 AM

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
146	@kmclaughlin can this be pulled out into a separate patch, or does it depend on changes in this patch in order to test it? I find the way the condition is written very confusing. It looks like the condition is synonymous, but it isn't. How about writing `EC.isVector()` instead?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9919	You're right, I didn't realise that, thanks!

Harbormaster completed remote builds in B102745: Diff 343044.May 5 2021, 8:39 AM

Removed changes to allowReordering() from this patch
Removed Hints.getWidth().isScalar() check from canVectorizeFPMath()
Changed canVectorizeFPMath to also look at induction variables, as we should not vectorize if the loop has any exact floating-point induction operators and we do not allow reassociation.
Added more tests to strict-fadd.ll which include floating-point induction variables to test the above changes.

kmclaughlin added inline comments.May 10 2021, 8:44 AM

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
146	@sdesmalen this doesn't depend on any other changes here so I've removed it from this patch
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
865	I've removed this check as I don't think it's necessary, but it was added to be consistent with allowReordering() which returns false if `EC.getKnownMinValue() > 1`

Harbormaster completed remote builds in B103497: Diff 344065.May 10 2021, 8:55 AM

sdesmalen added inline comments.May 10 2021, 9:28 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
873–879	is it sufficient to write: if (any_of(..... { })) return false; (i.e. if `ExactIndVars` is true, then `!getInductionVars().empty()` must also be true)
884	Is it still necessary to iterate through the reduction variables at this point? Given that EnableStrictReductions is true, and that reductions are the only other operations that can have exact FPMath instructions, I think you can just return `true`.

david-arm added inline comments.May 11 2021, 12:32 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
884	We don't support vectorising all of these reductions, for example we don't suppose strict reductions involving `fmul` and we don't support chains of `fadds` currently either. That's why in the code below we check `!RdxDesc.isOrdered()` because the ordered flag is only set for cases we can support at the moment. I think @kmclaughlin has added a test for this case as well below called `fast_induction_unordered_reduction` which shows how there are both fmul and fadd reductions in the same loop.

Removed the !getInductionVars().empty() test from canVectorizeFPMath()

david-arm added inline comments.May 12 2021, 6:13 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
887	nit: This is just a suggestion, but you could rename `ExactRdxVars` to `HasExactRdxVar` and then here simply do: return !HasExactRdxVar; since I think when the list is empty that variable should be false?

Harbormaster completed remote builds in B104031: Diff 344786.May 12 2021, 6:26 AM

Removed the getReductionVars().empty() test from canVectorizeFPMath() and renamed ExactRdxVar to HasExactRdxVar

Harbormaster completed remote builds in B104076: Diff 344868.May 12 2021, 11:27 AM

sdesmalen added inline comments.May 14 2021, 9:01 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
882	Can you rewrite this to: // We can now only vectorize if all reductions with Exact FP math also // have the isOrdered flag set, which indicates that we can move the // reduction operations in-loop. return all_of(getReductionVars(), [&](auto &Reduction) -> bool { RecurrenceDescriptor RdxDesc = Reduction.second; return !RdxDesc.hasExactFPMath() \|\| RdxDesc.isOrdered(); });
llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
7–17	Should all test functions have check lines for all VF8UF1, VF8UF4, etc. ? Conversely, is it sufficient to just pass the interleave-count hint (not the vector width) via metadata and have 1 RUN line for VF8UF1, VF8UF4, VF4UF1? Which also makes me wonder, what is the additional value of having both VF8UF1 and VF4UF1 ?

Removes HasExactRdxVar from canVectorizeFPMath() and instead returns the result from all_of(getReductionVars()...
Reduce the number of RUN lines in the tests

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
7–17	I think it should be sufficient to pass the interleave count via metadata. I've changed VF4UF1 to VF8UF1 as there was no additional benefit in having both, similarly I've changed VF4UF1 in the strict-fadd.ll test as well to reduce the number of RUN lines.

david-arm added inline comments.May 17 2021, 6:18 AM

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
82–104	Hi @kmclaughlin, I think maybe this is meant to be CHECK-VF8UF4?

Harbormaster completed remote builds in B104805: Diff 345849.May 17 2021, 6:49 AM

Fixed incorrect CHECK lines in the @fadd_strict_unroll_last_val test in strict-fadd.ll (CHECK-VF8UF2 -> CHECK-VF8UF4)

Harbormaster completed remote builds in B104818: Diff 345867.May 17 2021, 8:03 AM

sdesmalen added inline comments.May 17 2021, 1:32 PM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
76	Why does this test need to be vectorized with VF=vscale x 8 instead of VF=vscale x 4? Is that because it needs to be driven using the cmdline flags to circumvent the "hint allows reordering" behaviour? (and so that the 1 RUN line covers all tests?) If that's the case, can you do a NFC patch where you first change the test to use the new VF, and then rebase this patch on top?
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
1–7	The first test, `@fadd_strict`, has check lines that match the first RUN line, and the 4th RUN line, but none of the others. Why is that? And are all these RUN lines needed?
396–397	nit: ; Strict reduction could be performed in-loop, but ordered FP induction variables are not supported.
420	nit: this PHI is unnecessary? (same for the tests below)
424	Can you be more explicit in the comment? i.e. ; As above, but with the FP induction being unordered (fast), the loop can be vectorized.

kmclaughlin added inline comments.May 18 2021, 6:40 AM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
76	Hi @sdesmalen, it is the case that I changed the VF so that the test could be covered by one RUN line, and to try and circumvent the hints allow reordering behaviour. I will move changes to the RUN lines into a new patch so that this patch only adds new tests.
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
1–7	Since the allowReordering() function returns false if `EC.getKnownMinValue() > 1`, I thought it was worth making sure that we don't vectorize a VF of 1 for at least one of the tests, which is why I added the extra RUN line to `@fadd_strict`. The RUN lines are needed so that we can pass the different VFs & interleave counts needed for each of the tests (e.g. `@fadd_strict_unroll` needs a UF > 1) and I didn't want to change the 'allow reordering' behaviour by passing hints through metadata. Though I think I could remove the `CHECK-PRED` line since the `@fadd_predicated` does rely on metadata if this would help at all?

sdesmalen added inline comments.May 19 2021, 1:35 AM

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
76	I think it's worth adding a flag to the vectorizer to disable this weird behaviour for testing purposes, so that we don't need to change this test, and so that you don't need the multiple RUN lines in the other test in favour of using metadata to control the VF per individual test.

kmclaughlin mentioned this in D102774: [LoopVectorize] Add a flag to prevent reordering of FP operations with hints.May 19 2021, 6:52 AM

Rebased on the dependent changes, D102774
Removed the separate RUN lines from both strict-fadd.ll & scalable-strict-fadd.ll for different VF/UFs.
The tests now use one RUN line with the -hints-allow-reordering=false flag. This uses the existing CHECK lines in the tests, which prior to these changes only tested vectorization where reordering was allowed.

kmclaughlin added a parent revision: D102774: [LoopVectorize] Add a flag to prevent reordering of FP operations with hints.May 19 2021, 7:18 AM

Harbormaster completed remote builds in B105233: Diff 346450.May 19 2021, 7:18 AM

As discussed with @sdesmalen, I have made the following changes to this patch so that the tests are clearer:

Combined this patch with D102774, adding the -hints-allow-reordering flag here
Added CHECK-ORDERED and CHECK-UNORDERED lines to the tests in D103015
Updated the RUN lines in this patch to use the -hints-allow-reordering flag and also added a RUN line for CHECK-NOT-VECTORIZED. The tests themselves remain unchanged, with the exception of the new tests added for induction variables.

Harbormaster completed remote builds in B105890: Diff 347364.May 24 2021, 6:18 AM

kmclaughlin edited parent revisions, added: D103015: [NFC] Add CHECK lines for unordered FP reductions; removed: D102774: [LoopVectorize] Add a flag to prevent reordering of FP operations with hints.May 24 2021, 6:20 AM

kmclaughlin edited the summary of this revision. (Show Details)May 24 2021, 6:26 AM

sdesmalen added inline comments.May 24 2021, 6:30 AM

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
138	I'd suggest moving the implementation of `allowReordering` to `LoopVectorizationLegality.cpp`, so that you don't need to add the extern+include above.
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
866–867	nit: can you fold this into the condition below: if (!EnableStrictReductions \|\| any_of(...)) return false;
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
3	Can you also add a RUN line for `-enable-strict-reductions=true -hints-allow-reordering=true` (which I think can reuse prefix CHECK-UNORDERED)

david-arm added inline comments.May 24 2021, 7:21 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
866–867	I guess the comment below will need updating too if the check is moved.
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
474	Are these new tests missing hints that the other tests seem to use? I just wondered if it was better to be consistent here that's all. The reason I mention this is because I was expecting the UNORDERED case to vectorise due to the `-hints-allow-reordering=true` flag.

david-arm added inline comments.May 24 2021, 7:37 AM

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
474	I think I see now @kmclaughlin - you're testing the productisation of `-enable-strict-reductions` so you were adding some tests deliberately without hints, which also makes sense. In this case I'd also be happy if you left these tests as they are and just added some comments explaining why we expect the CHECK-UNORDERED case to not vectorise.

Moved the implementation of allowReordering() to LoopVectorizationLegality.cpp
Added a comment above the new tests in strict-fadd.ll to explain why the CHECK-UNORDERED case should not vectorize

Harbormaster completed remote builds in B106080: Diff 347653.May 25 2021, 5:56 AM

kmclaughlin added inline comments.May 25 2021, 5:57 AM

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
3	Hi @sdesmalen, I haven't added another RUN line for `-enable-strict-reductions=true -hints-allow-reordering=true` as this will cause failures with the fadd_multiple test. For most of the tests, the output where both flags are true will match the CHECK-ORDERED lines, since we will always use strict reductions where possible if this flag is set. For fadd_multiple, we cannot use strict reductions and so the value of `-hints-allow-reordering` will change whether or not the test vectorizes. As we discussed previously, I will follow this up with a patch which ensures we only choose strict reductions if we do not allow reordering. At this point I can add a RUN line as you've suggested and reuse the `CHECK-UNORDERED` prefix.
474	Hi @david-arm, I've added a comment above these tests to explain why the CHECK-UNORDERED case shouldn't vectorize.

Other than my request for a FIXME, I'm happy with the patch. LGTM!

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
3	Fair enough, thanks for explaining. Can you just add a FIXME above the `cl::opt` for `-enable-strict-reductions` that this flag reverses the default behaviour we have now when hints are passed?

This revision is now accepted and ready to land.May 25 2021, 2:12 PM

This revision was landed with ongoing or failed builds.May 26 2021, 6:06 AM

Closed by commit rG9f76a8526010: [LoopVectorize] Enable strict reductions when allowReordering() returns false (authored by kmclaughlin). · Explain Why

This revision was automatically updated to reflect the committed changes.

kmclaughlin added a commit: rG9f76a8526010: [LoopVectorize] Enable strict reductions when allowReordering() returns false.

Thank you @sdesmalen & @david-arm for the reviews and comments!

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

8 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

33 lines

LoopVectorize.cpp

2 lines

test/

Transforms/

LoopVectorize/

AArch64/

scalable-strict-fadd.ll

221 lines

strict-fadd.ll

467 lines

Diff 344786

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	public:
}		}

bool isScalable() const { return Scalable.Value; }		bool isScalable() const { return Scalable.Value; }

/// If hints are provided that force vectorization, use the AlwaysPrint		/// If hints are provided that force vectorization, use the AlwaysPrint
/// pass name to force the frontend to print the diagnostic.		/// pass name to force the frontend to print the diagnostic.
const char *vectorizeAnalysisPassName() const;		const char *vectorizeAnalysisPassName() const;

bool allowReordering() const {		bool allowReordering() const {
		sdesmalenUnsubmitted Done Reply Inline Actions I'd suggest moving the implementation of `allowReordering` to `LoopVectorizationLegality.cpp`, so that you don't need to add the extern+include above. sdesmalen: I'd suggest moving the implementation of `allowReordering` to `LoopVectorizationLegality.cpp`…
// When enabling loop hints are provided we allow the vectorizer to change		// When enabling loop hints are provided we allow the vectorizer to change
// the order of operations that is given by the scalar loop. This is not		// the order of operations that is given by the scalar loop. This is not
// enabled by default because can be unsafe or inefficient. For example,		// enabled by default because can be unsafe or inefficient. For example,
// reordering floating-point operations will change the way round-off		// reordering floating-point operations will change the way round-off
// error accumulates in the loop.		// error accumulates in the loop.
ElementCount EC = getWidth();		ElementCount EC = getWidth();
return getForce() == LoopVectorizeHints::FK_Enabled \|\|		return getForce() == LoopVectorizeHints::FK_Enabled \|\|
EC.getKnownMinValue() > 1;		EC.getKnownMinValue() > 1;
		sdesmalenUnsubmitted Not Done Reply Inline Actions is this change necessary? sdesmalen: is this change necessary?
		david-armUnsubmitted Not Done Reply Inline Actions It's fixing a missing case where we weren't previously allowing reordering for scalable VF=vscale x 1. I think it's worth fixing, but maybe it doesn't have to live in this patch? david-arm: It's fixing a missing case where we weren't previously allowing reordering for scalable…
		sdesmalenUnsubmitted Not Done Reply Inline Actions @kmclaughlin can this be pulled out into a separate patch, or does it depend on changes in this patch in order to test it? I find the way the condition is written very confusing. It looks like the condition is synonymous, but it isn't. How about writing `EC.isVector()` instead? sdesmalen: @kmclaughlin can this be pulled out into a separate patch, or does it depend on changes in this…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions @sdesmalen this doesn't depend on any other changes here so I've removed it from this patch kmclaughlin: @sdesmalen this doesn't depend on any other changes here so I've removed it from this patch
}		}

bool isPotentiallyUnsafe() const {		bool isPotentiallyUnsafe() const {
// Avoid FP vectorization if the target is unsure about proper support.		// Avoid FP vectorization if the target is unsure about proper support.
// This may be related to the SIMD unit in the target not handling		// This may be related to the SIMD unit in the target not handling
// IEEE 754 FP ops properly, or bad single-to-double promotions.		// IEEE 754 FP ops properly, or bad single-to-double promotions.
// Otherwise, a sequence of vectorized loops, even without reduction,		// Otherwise, a sequence of vectorized loops, even without reduction,
// could lead to different end results on the destination vectors.		// could lead to different end results on the destination vectors.
Show All 35 Lines	void addExactFPMathInst(Instruction *I) {
if (I && !ExactFPMathInst)		if (I && !ExactFPMathInst)
ExactFPMathInst = I;		ExactFPMathInst = I;
}		}

void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; }		void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; }


Instruction *getExactFPInst() { return ExactFPMathInst; }		Instruction *getExactFPInst() { return ExactFPMathInst; }
bool canVectorizeFPMath(const LoopVectorizeHints &Hints) const {
return !ExactFPMathInst \|\| Hints.allowReordering();
}

unsigned getNumRuntimePointerChecks() const {		unsigned getNumRuntimePointerChecks() const {
return NumRuntimePointerChecks;		return NumRuntimePointerChecks;
}		}

private:		private:
unsigned NumRuntimePointerChecks = 0;		unsigned NumRuntimePointerChecks = 0;
Instruction *ExactFPMathInst = nullptr;		Instruction *ExactFPMathInst = nullptr;
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	public:
/// This does not mean that it is profitable to vectorize this		/// This does not mean that it is profitable to vectorize this
/// loop, only that it is legal to do so.		/// loop, only that it is legal to do so.
/// Temporarily taking UseVPlanNativePath parameter. If true, take		/// Temporarily taking UseVPlanNativePath parameter. If true, take
/// the new code path being implemented for outer loop vectorization		/// the new code path being implemented for outer loop vectorization
/// (should be functional for inner loop vectorization) based on VPlan.		/// (should be functional for inner loop vectorization) based on VPlan.
/// If false, good old LV code.		/// If false, good old LV code.
bool canVectorize(bool UseVPlanNativePath);		bool canVectorize(bool UseVPlanNativePath);

		/// Returns true if it is legal to vectorize the FP math operations in this
		/// loop. Vectorizing is legal if we allow reordering of FP operations, or if
		/// we can use in-order reductions.
		bool canVectorizeFPMath(bool EnableStrictReductions);
		david-armUnsubmitted Done Reply Inline Actions I think we can avoid passing in the Hints here as they are already a member of the class with the same name? david-arm: I think we can avoid passing in the Hints here as they are already a member of the class with…

/// Return true if we can vectorize this loop while folding its tail by		/// Return true if we can vectorize this loop while folding its tail by
/// masking, and mark all respective loads/stores for masking.		/// masking, and mark all respective loads/stores for masking.
/// This object's state is only modified iff this function returns true.		/// This object's state is only modified iff this function returns true.
bool prepareToFoldTailByMasking();		bool prepareToFoldTailByMasking();

/// Returns the primary induction variable.		/// Returns the primary induction variable.
PHINode *getPrimaryInduction() { return PrimaryInduction; }		PHINode *getPrimaryInduction() { return PrimaryInduction; }

▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 851 Lines • ▼ Show 20 Lines	if (LAI->hasDependenceInvolvingLoopInvariantAddress()) {
return false;		return false;
}		}
Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());		Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());
PSE.addPredicate(LAI->getPSE().getUnionPredicate());		PSE.addPredicate(LAI->getPSE().getUnionPredicate());

return true;		return true;
}		}

		bool LoopVectorizationLegality::canVectorizeFPMath(
		bool EnableStrictReductions) {

		// First check if there is any ExactFP math or if we allow reassociations
		if (!Requirements->getExactFPInst() \|\| Hints->allowReordering())
		return true;
		sdesmalenUnsubmitted Not Done Reply Inline Actions Should a hint of VF=1 really lead to the diagnostic `"loop not vectorized: cannot prove it is safe to reorder floating-point operations"`? sdesmalen: Should a hint of VF=1 really lead to the diagnostic `"loop not vectorized: cannot prove it is…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions I've removed this check as I don't think it's necessary, but it was added to be consistent with allowReordering() which returns false if `EC.getKnownMinValue() > 1` kmclaughlin: I've removed this check as I don't think it's necessary, but it was added to be consistent with…

		sdesmalenUnsubmitted Not Done Reply Inline Actions This will also return false if there are no reductions at all, or if all reductions are unordered? sdesmalen: This will also return false if there are no reductions at all, or if all reductions are…
		if (!EnableStrictReductions)
		sdesmalenUnsubmitted Done Reply Inline Actions nit: can you fold this into the condition below: if (!EnableStrictReductions \|\| any_of(...)) return false; sdesmalen: nit: can you fold this into the condition below: if (!EnableStrictReductions \|\| any_of(...))…
		david-armUnsubmitted Done Reply Inline Actions I guess the comment below will need updating too if the check is moved. david-arm: I guess the comment below will need updating too if the check is moved.
		return false;

		// If the above is false, we have ExactFPMath & do not allow reordering.
		// First check if we have any Exact FP induction vars, which we cannot
		sdesmalenUnsubmitted Not Done Reply Inline Actions How about returning true if for each reduction variable, any of the following conditions is true: The reduction is no ExactFPMath instruction for the reduction. The reduction is unordered. EnableStrictReductions is true. sdesmalen: How about returning true if for each reduction variable, any of the following conditions is…
		// vectorize.
		sdesmalenUnsubmitted Not Done Reply Inline Actions I'd prefer the default case to be `return false;`, i.e. when we cannot explicitly determine it is safe, we assume it isn't safe. That would handle the case where `Requirements->getExactFPInst()` is true, but it isn't an instruction used in the reduction. (although I don't know if that would ever happen?) sdesmalen: I'd prefer the default case to be `return false;`, i.e. when we cannot explicitly determine it…
		if (any_of(getInductionVars(), [&](auto &Induction) -> bool {
		InductionDescriptor IndDesc = Induction.second;
		return IndDesc.getExactFPMathInst();
		}))
		return false;

		// We can now only vectorize if all reductions with Exact FP math also
		sdesmalenUnsubmitted Done Reply Inline Actions is it sufficient to write: if (any_of(..... { })) return false; (i.e. if `ExactIndVars` is true, then `!getInductionVars().empty()` must also be true) sdesmalen: is it sufficient to write: if (any_of(..... { })) return false; (i.e. if `ExactIndVars`…
		// have the isOrdered flag set, which indicates that we can move the
		// reduction operations in-loop.
		bool ExactRdxVars = (any_of(getReductionVars(), [&](auto &Reduction) -> bool {
		sdesmalenUnsubmitted Done Reply Inline Actions Can you rewrite this to: // We can now only vectorize if all reductions with Exact FP math also // have the isOrdered flag set, which indicates that we can move the // reduction operations in-loop. return all_of(getReductionVars(), [&](auto &Reduction) -> bool { RecurrenceDescriptor RdxDesc = Reduction.second; return !RdxDesc.hasExactFPMath() \|\| RdxDesc.isOrdered(); }); sdesmalen: Can you rewrite this to: // We can now only vectorize if all reductions with Exact FP math…
		RecurrenceDescriptor RdxDesc = Reduction.second;
		return RdxDesc.hasExactFPMath() && !RdxDesc.isOrdered();
		sdesmalenUnsubmitted Not Done Reply Inline Actions Is it still necessary to iterate through the reduction variables at this point? Given that EnableStrictReductions is true, and that reductions are the only other operations that can have exact FPMath instructions, I think you can just return `true`. sdesmalen: Is it still necessary to iterate through the reduction variables at this point? Given that…
		david-armUnsubmitted Not Done Reply Inline Actions We don't support vectorising all of these reductions, for example we don't suppose strict reductions involving `fmul` and we don't support chains of `fadds` currently either. That's why in the code below we check `!RdxDesc.isOrdered()` because the ordered flag is only set for cases we can support at the moment. I think @kmclaughlin has added a test for this case as well below called `fast_induction_unordered_reduction` which shows how there are both fmul and fadd reductions in the same loop. david-arm: We don't support vectorising all of these reductions, for example we don't suppose strict…
		}));

		if (getReductionVars().empty() \|\| !ExactRdxVars)
		david-armUnsubmitted Done Reply Inline Actions nit: This is just a suggestion, but you could rename `ExactRdxVars` to `HasExactRdxVar` and then here simply do: return !HasExactRdxVar; since I think when the list is empty that variable should be false? david-arm: nit: This is just a suggestion, but you could rename `ExactRdxVars` to `HasExactRdxVar` and…
		return true;

		return false;
		}

bool LoopVectorizationLegality::isInductionPhi(const Value *V) {		bool LoopVectorizationLegality::isInductionPhi(const Value *V) {
Value In0 = const_cast<Value >(V);		Value In0 = const_cast<Value >(V);
PHINode *PN = dyn_cast_or_null<PHINode>(In0);		PHINode *PN = dyn_cast_or_null<PHINode>(In0);
if (!PN)		if (!PN)
return false;		return false;

return Inductions.count(PN);		return Inductions.count(PN);
}		}
▲ Show 20 Lines • Show All 389 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,795 Lines • ▼ Show 20 Lines	while (!Worklist.empty()) {
for (Use &Op : I->operands())		for (Use &Op : I->operands())
if (auto *OpI = dyn_cast<Instruction>(Op))		if (auto *OpI = dyn_cast<Instruction>(Op))
Worklist.push_back(OpI);		Worklist.push_back(OpI);
}		}
}		}

LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)		LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)
: InterleaveOnlyWhenForced(Opts.InterleaveOnlyWhenForced \|\|		: InterleaveOnlyWhenForced(Opts.InterleaveOnlyWhenForced \|\|
!EnableLoopInterleaving),		!EnableLoopInterleaving),
VectorizeOnlyWhenForced(Opts.VectorizeOnlyWhenForced \|\|		VectorizeOnlyWhenForced(Opts.VectorizeOnlyWhenForced \|\|
		sdesmalenUnsubmitted Done Reply Inline Actions nit: indentation, please use clang-format. sdesmalen: nit: indentation, please use clang-format.
!EnableLoopVectorization) {}		!EnableLoopVectorization) {}

bool LoopVectorizePass::processLoop(Loop *L) {		bool LoopVectorizePass::processLoop(Loop *L) {
assert((EnableVPlanNativePath \|\| L->isInnermost()) &&		assert((EnableVPlanNativePath \|\| L->isInnermost()) &&
"VPlan-native path is not enabled. Only process inner loops.");		"VPlan-native path is not enabled. Only process inner loops.");
		sdesmalenUnsubmitted Not Done Reply Inline Actions Why do all of the reductions have to be ordered for the LV to be able to vectorize FP math? (e.g. if there is an integer reduction and an ordered FP reduction, it would now choose not to vectorize based on this condition) sdesmalen: Why do all of the reductions have to be ordered for the LV to be able to vectorize FP math? (e.
		david-armUnsubmitted Done Reply Inline Actions I guess it might be worth adding a test for this too then, i.e. having a loop with both an integer and FP reduction and ensure we vectorise with ordered reductions. david-arm: I guess it might be worth adding a test for this too then, i.e. having a loop with both an…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Hi @sdesmalen, we should only need the FP reductions in the loop to be ordered. I've changed this so that only reductions where `hasExactFPMath()` is true need to be ordered & added a test for this scenario to strict-fadd.ll kmclaughlin: Hi @sdesmalen, we should only need the FP reductions in the loop to be ordered. I've changed…

#ifndef NDEBUG		#ifndef NDEBUG
const std::string DebugLocStr = getDebugLocString(L);		const std::string DebugLocStr = getDebugLocString(L);
#endif /* NDEBUG */		#endif /* NDEBUG */

LLVM_DEBUG(dbgs() << "\nLV: Checking a loop in \""		LLVM_DEBUG(dbgs() << "\nLV: Checking a loop in \""
<< L->getHeader()->getParent()->getName() << "\" from "		<< L->getHeader()->getParent()->getName() << "\" from "
<< DebugLocStr << "\n");		<< DebugLocStr << "\n");
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	if (Hints.isPotentiallyUnsafe() &&
reportVectorizationFailure(		reportVectorizationFailure(
"Potentially unsafe FP op prevents vectorization",		"Potentially unsafe FP op prevents vectorization",
"loop not vectorized due to unsafe FP support.",		"loop not vectorized due to unsafe FP support.",
"UnsafeFP", ORE, L);		"UnsafeFP", ORE, L);
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

if (!Requirements.canVectorizeFPMath(Hints)) {		if (!LVL.canVectorizeFPMath(EnableStrictReductions)) {
		sdesmalenUnsubmitted Not Done Reply Inline Actions You don't need to pass EnableStrictReductions, since it is defined in the same file? sdesmalen: You don't need to pass EnableStrictReductions, since it is defined in the same file?
		david-armUnsubmitted Not Done Reply Inline Actions I think it has to be passed since `EnableStrictReductions` lives in LoopVectorize.cpp and canVectorizeFPMath lives in LoopVectorizationLegality.cpp. david-arm: I think it has to be passed since `EnableStrictReductions` lives in LoopVectorize.cpp and…
		sdesmalenUnsubmitted Not Done Reply Inline Actions You're right, I didn't realise that, thanks! sdesmalen: You're right, I didn't realise that, thanks!
ORE->emit([&]() {		ORE->emit([&]() {
		sdesmalenUnsubmitted Not Done Reply Inline Actions This condition is a bit odd. Should `canVectorizeOrderedFPMath` just contain the call to `Requirements.canVectorizeFPMath` instead? i.e. in order to vectorize ordered FP math, it must at least be able to vectorize FP math. sdesmalen: This condition is a bit odd. Should `canVectorizeOrderedFPMath` just contain the call to…
		david-armUnsubmitted Not Done Reply Inline Actions I think the existing canVectorizeFPMath function is badly named because it actually checks for reordering: bool canVectorizeFPMath(const LoopVectorizeHints &Hints) const { return !ExactFPMathInst \|\| Hints.allowReordering(); } so the logic in Kerry's patch is something like this: Is this an exact FP math instruction? If not -> vectorise, else Do hints permit reordering? If so -> vectorise, else Can we vectorise with ordered reductions? If not -> emit remark. It probably is possible to combine these into a single LoopVectorizationLegality::canVectorizeFPMath function that does all the above, since that class does have access to the Requirements I think. david-arm: I think the existing canVectorizeFPMath function is badly named because it actually checks for…
auto *ExactFPMathInst = Requirements.getExactFPInst();		auto *ExactFPMathInst = Requirements.getExactFPInst();
return OptimizationRemarkAnalysisFPCommute(DEBUG_TYPE, "CantReorderFPOps",		return OptimizationRemarkAnalysisFPCommute(DEBUG_TYPE, "CantReorderFPOps",
ExactFPMathInst->getDebugLoc(),		ExactFPMathInst->getDebugLoc(),
ExactFPMathInst->getParent())		ExactFPMathInst->getParent())
<< "loop not vectorized: cannot prove it is safe to reorder "		<< "loop not vectorized: cannot prove it is safe to reorder "
"floating-point operations";		"floating-point operations";
});		});
LLVM_DEBUG(dbgs() << "LV: loop not vectorized: cannot prove it is safe to "		LLVM_DEBUG(dbgs() << "LV: loop not vectorized: cannot prove it is safe to "
▲ Show 20 Lines • Show All 343 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

	; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -mattr=+sve -enable-strict-reductions -S \| FileCheck %s -check-prefix=CHECK			; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -mattr=+sve -force-vector-width=8 -force-vector-interleave=1 -enable-strict-reductions -S \| FileCheck %s -check-prefix=CHECK-VF8UF1
				; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -mattr=+sve -force-vector-width=8 -force-vector-interleave=4 -enable-strict-reductions -S \| FileCheck %s -check-prefix=CHECK-VF8UF4
				; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -mattr=+sve -force-vector-width=4 -force-vector-interleave=1 -enable-strict-reductions -S \| FileCheck %s -check-prefix=CHECK-VF4UF1
				; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -mattr=+sve -enable-strict-reductions -force-vector-width=1 -force-vector-interleave=1 -S \| FileCheck %s -check-prefix=CHECK-VF1UF1

	define float @fadd_strict(float* noalias nocapture readonly %a, i64 %n) {			define float @fadd_strict(float* noalias nocapture readonly %a, i64 %n) {
	; CHECK-LABEL: @fadd_strict			; CHECK-VF8UF1-LABEL: @fadd_strict
	; CHECK: vector.body:			; CHECK-VF8UF1: vector.body:
	; CHECK: %[[VEC_PHI:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]			; CHECK-VF8UF1: %[[VEC_PHI:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]
	; CHECK: %[[LOAD:.]] = load <vscale x 8 x float>, <vscale x 8 x float>			; CHECK-VF8UF1: %[[LOAD:.]] = load <vscale x 8 x float>, <vscale x 8 x float>
	; CHECK: %[[RDX]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[VEC_PHI]], <vscale x 8 x float> %[[LOAD]])			; CHECK-VF8UF1: %[[RDX]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[VEC_PHI]], <vscale x 8 x float> %[[LOAD]])
	; CHECK: for.end			; CHECK-VF8UF1: for.end
	; CHECK: %[[PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX]], %middle.block ]			; CHECK-VF8UF1: %[[PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX]], %middle.block ]
	; CHECK: ret float %[[PHI]]			; CHECK-VF8UF1: ret float %[[PHI]]

				; CHECK-VF1UF1: vector.body
				david-armUnsubmitted Done Reply Inline Actions Thanks a lot for adding the VF=vscale x 1 case here, but perhaps `CHECK-SCALAR` should be `CHECK-VF1U1`, since we're still vectorising? Also, it's probably worth adding an extra CHECK line for at least one instruction that shows the "<vscale x 1 ..." - maybe the `call float ...` instruction? david-arm: Thanks a lot for adding the VF=vscale x 1 case here, but perhaps `CHECK-SCALAR` should be…
				; CHECK-VF1UF1: call float @llvm.vector.reduce.fadd.nxv1f32(float {{.}}, <vscale x 1 x float> {{.}})
				sdesmalenUnsubmitted Done Reply Inline Actions Should all test functions have check lines for all VF8UF1, VF8UF4, etc. ? Conversely, is it sufficient to just pass the interleave-count hint (not the vector width) via metadata and have 1 RUN line for VF8UF1, VF8UF4, VF4UF1? Which also makes me wonder, what is the additional value of having both VF8UF1 and VF4UF1 ? sdesmalen: Should all test functions have check lines for all VF8UF1, VF8UF4, etc. ? Conversely, is it…
				kmclaughlinAuthorUnsubmitted Not Done Reply Inline Actions I think it should be sufficient to pass the interleave count via metadata. I've changed VF4UF1 to VF8UF1 as there was no additional benefit in having both, similarly I've changed VF4UF1 in the strict-fadd.ll test as well to reduce the number of RUN lines. kmclaughlin: I think it should be sufficient to pass the interleave count via metadata. I've changed VF4UF1…
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%add = fadd float %0, %sum.07			%add = fadd float %0, %sum.07
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0			br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

	for.end:			for.end:
	ret float %add			ret float %add
	}			}

	define float @fadd_strict_unroll(float* noalias nocapture readonly %a, i64 %n) {			define float @fadd_strict_unroll(float* noalias nocapture readonly %a, i64 %n) {
	; CHECK-LABEL: @fadd_strict_unroll			; CHECK-VF8UF4-LABEL: @fadd_strict_unroll
	; CHECK: vector.body:			; CHECK-VF8UF4: vector.body:
	; CHECK: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4:.]], %vector.body ]			; CHECK-VF8UF4: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4:.]], %vector.body ]
	; CHECK-NOT: phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4]], %vector.body ]			; CHECK-VF8UF4-NOT: phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4]], %vector.body ]
	; CHECK: %[[LOAD1:.]] = load <vscale x 8 x float>, <vscale x 8 x float>			; CHECK-VF8UF4: %[[LOAD1:.]] = load <vscale x 8 x float>, <vscale x 8 x float>
	; CHECK: %[[LOAD2:.]] = load <vscale x 8 x float>, <vscale x 8 x float>			; CHECK-VF8UF4: %[[LOAD2:.]] = load <vscale x 8 x float>, <vscale x 8 x float>
	; CHECK: %[[LOAD3:.]] = load <vscale x 8 x float>, <vscale x 8 x float>			; CHECK-VF8UF4: %[[LOAD3:.]] = load <vscale x 8 x float>, <vscale x 8 x float>
	; CHECK: %[[LOAD4:.]] = load <vscale x 8 x float>, <vscale x 8 x float>			; CHECK-VF8UF4: %[[LOAD4:.]] = load <vscale x 8 x float>, <vscale x 8 x float>
	; CHECK: %[[RDX1:.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[VEC_PHI1]], <vscale x 8 x float> %[[LOAD1]])			; CHECK-VF8UF4: %[[RDX1:.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[VEC_PHI1]], <vscale x 8 x float> %[[LOAD1]])
	; CHECK: %[[RDX2:.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[RDX1]], <vscale x 8 x float> %[[LOAD2]])			; CHECK-VF8UF4: %[[RDX2:.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[RDX1]], <vscale x 8 x float> %[[LOAD2]])
	; CHECK: %[[RDX3:.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[RDX2]], <vscale x 8 x float> %[[LOAD3]])			; CHECK-VF8UF4: %[[RDX3:.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[RDX2]], <vscale x 8 x float> %[[LOAD3]])
	; CHECK: %[[RDX4]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[RDX3]], <vscale x 8 x float> %[[LOAD4]])			; CHECK-VF8UF4: %[[RDX4]] = call float @llvm.vector.reduce.fadd.nxv8f32(float %[[RDX3]], <vscale x 8 x float> %[[LOAD4]])
	; CHECK: for.end			; CHECK-VF8UF4: for.end
	; CHECK: %[[PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX4]], %middle.block ]			; CHECK-VF8UF4: %[[PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX4]], %middle.block ]
	; CHECK: ret float %[[PHI]]			; CHECK-VF8UF4: ret float %[[PHI]]
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%add = fadd float %0, %sum.07			%add = fadd float %0, %sum.07
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1			br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

	for.end:			for.end:
	ret float %add			ret float %add
	}			}

	define void @fadd_strict_interleave(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {			define void @fadd_strict_interleave(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {
	; CHECK-LABEL: @fadd_strict_interleave			; CHECK-VF4UF1-LABEL: @fadd_strict_interleave
	; CHECK: entry			; CHECK-VF4UF1: entry
	; CHECK: %[[ARRAYIDX:.]] = getelementptr inbounds float, float %a, i64 1			; CHECK-VF4UF1: %[[ARRAYIDX:.]] = getelementptr inbounds float, float %a, i64 1
	; CHECK: %[[LOAD1:.]] = load float, float %a			; CHECK-VF4UF1: %[[LOAD1:.]] = load float, float %a
	; CHECK: %[[LOAD2:.]] = load float, float %[[ARRAYIDX]]			; CHECK-VF4UF1: %[[LOAD2:.]] = load float, float %[[ARRAYIDX]]
	; CHECK: vector.ph			; CHECK-VF4UF1: vector.ph
	; CHECK: %[[STEPVEC1:.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()			; CHECK-VF4UF1: %[[STEPVEC1:.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
	; CHECK: %[[STEPVEC_ADD1:.*]] = add <vscale x 4 x i64> %[[STEPVEC1]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)			; CHECK-VF4UF1: %[[STEPVEC_ADD1:.*]] = add <vscale x 4 x i64> %[[STEPVEC1]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				sdesmalenUnsubmitted Not Done Reply Inline Actions Why does this test need to be vectorized with VF=vscale x 8 instead of VF=vscale x 4? Is that because it needs to be driven using the cmdline flags to circumvent the "hint allows reordering" behaviour? (and so that the 1 RUN line covers all tests?) If that's the case, can you do a NFC patch where you first change the test to use the new VF, and then rebase this patch on top? sdesmalen: Why does this test need to be vectorized with VF=vscale x 8 instead of VF=vscale x 4? Is that…
				kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Hi @sdesmalen, it is the case that I changed the VF so that the test could be covered by one RUN line, and to try and circumvent the hints allow reordering behaviour. I will move changes to the RUN lines into a new patch so that this patch only adds new tests. kmclaughlin: Hi @sdesmalen, it is the case that I changed the VF so that the test could be covered by one…
				sdesmalenUnsubmitted Done Reply Inline Actions I think it's worth adding a flag to the vectorizer to disable this weird behaviour for testing purposes, so that we don't need to change this test, and so that you don't need the multiple RUN lines in the other test in favour of using metadata to control the VF per individual test. sdesmalen: I think it's worth adding a flag to the vectorizer to disable this weird behaviour for testing…
	; CHECK: %[[STEPVEC_MUL:.*]] = mul <vscale x 4 x i64> %[[STEPVEC_ADD1]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 2, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)			; CHECK-VF4UF1: %[[STEPVEC_MUL:.*]] = mul <vscale x 4 x i64> %[[STEPVEC_ADD1]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 2, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
	; CHECK: %[[INDUCTION:.*]] = add <vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer), %[[STEPVEC_MUL]]			; CHECK-VF4UF1: %[[INDUCTION:.*]] = add <vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer), %[[STEPVEC_MUL]]
	; CHECK: vector.body			; CHECK-VF4UF1: vector.body
	; CHECK: %[[VEC_PHI2:.]] = phi float [ %[[LOAD2]], %vector.ph ], [ %[[RDX2:.]], %vector.body ]			; CHECK-VF4UF1: %[[VEC_PHI2:.]] = phi float [ %[[LOAD2]], %vector.ph ], [ %[[RDX2:.]], %vector.body ]
	; CHECK: %[[VEC_PHI1:.]] = phi float [ %[[LOAD1]], %vector.ph ], [ %[[RDX1:.]], %vector.body ]			; CHECK-VF4UF1: %[[VEC_PHI1:.]] = phi float [ %[[LOAD1]], %vector.ph ], [ %[[RDX1:.]], %vector.body ]
	; CHECK: %[[VEC_IND:.]] = phi <vscale x 4 x i64> [ %[[INDUCTION]], %vector.ph ], [ {{.}}, %vector.body ]			; CHECK-VF4UF1: %[[VEC_IND:.]] = phi <vscale x 4 x i64> [ %[[INDUCTION]], %vector.ph ], [ {{.}}, %vector.body ]
	; CHECK: %[[GEP1:.]] = getelementptr inbounds float, float %b, <vscale x 4 x i64> %[[VEC_IND]]			; CHECK-VF4UF1: %[[GEP1:.]] = getelementptr inbounds float, float %b, <vscale x 4 x i64> %[[VEC_IND]]
	; CHECK: %[[MGATHER1:.]] = call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0f32(<vscale x 4 x float> %[[GEP1]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> undef, i1 true, i32 0), <vscale x 4 x i1> undef, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x float> undef)			; CHECK-VF4UF1: %[[MGATHER1:.]] = call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0f32(<vscale x 4 x float> %[[GEP1]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> undef, i1 true, i32 0), <vscale x 4 x i1> undef, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x float> undef)
	; CHECK: %[[RDX1]] = call float @llvm.vector.reduce.fadd.nxv4f32(float %[[VEC_PHI1]], <vscale x 4 x float> %[[MGATHER1]])			; CHECK-VF4UF1: %[[RDX1]] = call float @llvm.vector.reduce.fadd.nxv4f32(float %[[VEC_PHI1]], <vscale x 4 x float> %[[MGATHER1]])
	; CHECK: %[[OR:.*]] = or <vscale x 4 x i64> %[[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)			; CHECK-VF4UF1: %[[OR:.*]] = or <vscale x 4 x i64> %[[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
	; CHECK: %[[GEP2:.]] = getelementptr inbounds float, float %b, <vscale x 4 x i64> %[[OR]]			; CHECK-VF4UF1: %[[GEP2:.]] = getelementptr inbounds float, float %b, <vscale x 4 x i64> %[[OR]]
	; CHECK: %[[MGATHER2:.]] = call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0f32(<vscale x 4 x float> %[[GEP2]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> undef, i1 true, i32 0), <vscale x 4 x i1> undef, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x float> undef)			; CHECK-VF4UF1: %[[MGATHER2:.]] = call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0f32(<vscale x 4 x float> %[[GEP2]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> undef, i1 true, i32 0), <vscale x 4 x i1> undef, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x float> undef)
	; CHECK: %[[RDX2]] = call float @llvm.vector.reduce.fadd.nxv4f32(float %[[VEC_PHI2]], <vscale x 4 x float> %[[MGATHER2]])			; CHECK-VF4UF1: %[[RDX2]] = call float @llvm.vector.reduce.fadd.nxv4f32(float %[[VEC_PHI2]], <vscale x 4 x float> %[[MGATHER2]])
	; CHECK: for.end			; CHECK-VF4UF1: for.end
	; CHECK ret void			; CHECK-VF4UF1: ret void
	entry:			entry:
	%arrayidxa = getelementptr inbounds float, float* %a, i64 1			%arrayidxa = getelementptr inbounds float, float* %a, i64 1
	%a1 = load float, float* %a, align 4			%a1 = load float, float* %a, align 4
	%a2 = load float, float* %arrayidxa, align 4			%a2 = load float, float* %arrayidxa, align 4
	br label %for.body			br label %for.body

	for.body:			for.body:
	%add.phi1 = phi float [ %a2, %entry ], [ %add2, %for.body ]			%add.phi1 = phi float [ %a2, %entry ], [ %add2, %for.body ]
	%add.phi2 = phi float [ %a1, %entry ], [ %add1, %for.body ]			%add.phi2 = phi float [ %a1, %entry ], [ %add1, %for.body ]
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%arrayidxb1 = getelementptr inbounds float, float* %b, i64 %iv			%arrayidxb1 = getelementptr inbounds float, float* %b, i64 %iv
	%0 = load float, float* %arrayidxb1, align 4			%0 = load float, float* %arrayidxb1, align 4
	%add1 = fadd float %0, %add.phi2			%add1 = fadd float %0, %add.phi2
	%or = or i64 %iv, 1			%or = or i64 %iv, 1
	%arrayidxb2 = getelementptr inbounds float, float* %b, i64 %or			%arrayidxb2 = getelementptr inbounds float, float* %b, i64 %or
	%1 = load float, float* %arrayidxb2, align 4			%1 = load float, float* %arrayidxb2, align 4
	%add2 = fadd float %1, %add.phi1			%add2 = fadd float %1, %add.phi1
	%iv.next = add nuw nsw i64 %iv, 2			%iv.next = add nuw nsw i64 %iv, 2
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !2			br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

	for.end:			for.end:
	store float %add1, float* %a, align 4			store float %add1, float* %a, align 4
	store float %add2, float* %arrayidxa, align 4			store float %add2, float* %arrayidxa, align 4
	ret void			ret void
	}			}

	define float @fadd_invariant(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {			define float @fadd_invariant(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {
	; CHECK-LABEL: @fadd_invariant			; CHECK-VF4UF1-LABEL: @fadd_invariant
	; CHECK: vector.body			; CHECK-VF4UF1: vector.body
	; CHECK: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]			; CHECK-VF4UF1: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]
	; CHECK: %[[LOAD1:.]] = load <vscale x 4 x float>, <vscale x 4 x float>			; CHECK-VF4UF1: %[[LOAD1:.]] = load <vscale x 4 x float>, <vscale x 4 x float>
	; CHECK: %[[LOAD2:.]] = load <vscale x 4 x float>, <vscale x 4 x float>			; CHECK-VF4UF1: %[[LOAD2:.]] = load <vscale x 4 x float>, <vscale x 4 x float>
	; CHECK: %[[ADD:.*]] = fadd <vscale x 4 x float> %[[LOAD1]], %[[LOAD2]]			; CHECK-VF4UF1: %[[ADD:.*]] = fadd <vscale x 4 x float> %[[LOAD1]], %[[LOAD2]]
	; CHECK: %[[RDX]] = call float @llvm.vector.reduce.fadd.nxv4f32(float %[[VEC_PHI1]], <vscale x 4 x float> %[[ADD]])			; CHECK-VF4UF1: %[[RDX]] = call float @llvm.vector.reduce.fadd.nxv4f32(float %[[VEC_PHI1]], <vscale x 4 x float> %[[ADD]])
	; CHECK: for.end.loopexit			; CHECK-VF4UF1: for.end.loopexit
	; CHECK: %[[EXIT_PHI:.]] = phi float [ {{.}}, %for.body ], [ %[[RDX]], %middle.block ]			; CHECK-VF4UF1: %[[EXIT_PHI:.]] = phi float [ {{.}}, %for.body ], [ %[[RDX]], %middle.block ]
	; CHECK: for.end			; CHECK-VF4UF1: for.end
	; CHECK: %[[PHI:.*]] = phi float [ 0.000000e+00, %entry ], [ %[[EXIT_PHI]], %for.end.loopexit ]			; CHECK-VF4UF1: %[[PHI:.*]] = phi float [ 0.000000e+00, %entry ], [ %[[EXIT_PHI]], %for.end.loopexit ]
	; CHECK: ret float %[[PHI]]			; CHECK-VF4UF1: ret float %[[PHI]]
	entry:			entry:
	%arrayidx = getelementptr inbounds float, float* %a, i64 1			%arrayidx = getelementptr inbounds float, float* %a, i64 1
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%cmp1 = fcmp ogt float %0, 5.000000e-01			%cmp1 = fcmp ogt float %0, 5.000000e-01
	br i1 %cmp1, label %for.body, label %for.end			br i1 %cmp1, label %for.body, label %for.end

	for.body: ; preds = %for.body			for.body: ; preds = %for.body
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%res.014 = phi float [ 0.000000e+00, %entry ], [ %rdx, %for.body ]			%res.014 = phi float [ 0.000000e+00, %entry ], [ %rdx, %for.body ]
	%arrayidx2 = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx2 = getelementptr inbounds float, float* %a, i64 %iv
	%1 = load float, float* %arrayidx2, align 4			%1 = load float, float* %arrayidx2, align 4
	%arrayidx4 = getelementptr inbounds float, float* %b, i64 %iv			%arrayidx4 = getelementptr inbounds float, float* %b, i64 %iv
	%2 = load float, float* %arrayidx4, align 4			%2 = load float, float* %arrayidx4, align 4
	%add = fadd float %1, %2			%add = fadd float %1, %2
	%rdx = fadd float %res.014, %add			%rdx = fadd float %res.014, %add
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !2			br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	%res = phi float [ 0.000000e+00, %entry ], [ %rdx, %for.body ]			%res = phi float [ 0.000000e+00, %entry ], [ %rdx, %for.body ]
	ret float %res			ret float %res
	}			}

	define float @fadd_conditional(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {			define float @fadd_conditional(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {
	; CHECK-LABEL: @fadd_conditional			; CHECK-VF4UF1-LABEL: @fadd_conditional
	; CHECK: vector.body			; CHECK-VF4UF1: vector.body
	; CHECK: %[[VEC_PHI:.]] = phi float [ 1.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]			; CHECK-VF4UF1: %[[VEC_PHI:.]] = phi float [ 1.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]
	; CHECK: %[[LOAD:.]] = load <vscale x 4 x float>, <vscale x 4 x float>			; CHECK-VF4UF1: %[[LOAD:.]] = load <vscale x 4 x float>, <vscale x 4 x float>
	; CHECK: %[[FCMP:.*]] = fcmp une <vscale x 4 x float> %[[LOAD]], shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 0.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)			; CHECK-VF4UF1: %[[FCMP:.*]] = fcmp une <vscale x 4 x float> %[[LOAD]], shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 0.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
	; CHECK: %[[MASKED_LOAD:.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> {{.*}}, i32 4, <vscale x 4 x i1> %[[FCMP]], <vscale x 4 x float> poison)			; CHECK-VF4UF1: %[[MASKED_LOAD:.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> {{.*}}, i32 4, <vscale x 4 x i1> %[[FCMP]], <vscale x 4 x float> poison)
	; CHECK: %[[XOR:.*]] = xor <vscale x 4 x i1> %[[FCMP]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> undef, i1 true, i32 0), <vscale x 4 x i1> undef, <vscale x 4 x i32> zeroinitializer)			; CHECK-VF4UF1: %[[XOR:.*]] = xor <vscale x 4 x i1> %[[FCMP]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> undef, i1 true, i32 0), <vscale x 4 x i1> undef, <vscale x 4 x i32> zeroinitializer)
	; CHECK: %[[SELECT:.*]] = select <vscale x 4 x i1> %[[XOR]], <vscale x 4 x float> shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 3.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x float> %[[MASKED_LOAD]]			; CHECK-VF4UF1: %[[SELECT:.*]] = select <vscale x 4 x i1> %[[XOR]], <vscale x 4 x float> shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 3.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x float> %[[MASKED_LOAD]]
	; CHECK: %[[RDX]] = call float @llvm.vector.reduce.fadd.nxv4f32(float %[[VEC_PHI]], <vscale x 4 x float> %[[SELECT]])			; CHECK-VF4UF1: %[[RDX]] = call float @llvm.vector.reduce.fadd.nxv4f32(float %[[VEC_PHI]], <vscale x 4 x float> %[[SELECT]])
	; CHECK: scalar.ph			; CHECK-VF4UF1: scalar.ph
	; CHECK: %[[MERGE_RDX:.*]] = phi float [ 1.000000e+00, %entry ], [ %[[RDX]], %middle.block ]			; CHECK-VF4UF1: %[[MERGE_RDX:.*]] = phi float [ 1.000000e+00, %entry ], [ %[[RDX]], %middle.block ]
	; CHECK: for.body			; CHECK-VF4UF1: for.body
	; CHECK: %[[RES:.]] = phi float [ %[[MERGE_RDX]], %scalar.ph ], [ %[[FADD:.]], %for.inc ]			; CHECK-VF4UF1: %[[RES:.]] = phi float [ %[[MERGE_RDX]], %scalar.ph ], [ %[[FADD:.]], %for.inc ]
	; CHECK: if.then			; CHECK-VF4UF1: if.then
	; CHECK: %[[LOAD2:.]] = load float, float			; CHECK-VF4UF1: %[[LOAD2:.]] = load float, float
	; CHECK: for.inc			; CHECK-VF4UF1: for.inc
	; CHECK: %[[PHI:.*]] = phi float [ %[[LOAD2]], %if.then ], [ 3.000000e+00, %for.body ]			; CHECK-VF4UF1: %[[PHI:.*]] = phi float [ %[[LOAD2]], %if.then ], [ 3.000000e+00, %for.body ]
	; CHECK: %[[FADD]] = fadd float %[[RES]], %[[PHI]]			; CHECK-VF4UF1: %[[FADD]] = fadd float %[[RES]], %[[PHI]]
	; CHECK: for.end			; CHECK-VF4UF1: for.end
	; CHECK: %[[RDX_PHI:.*]] = phi float [ %[[FADD]], %for.inc ], [ %[[RDX]], %middle.block ]			; CHECK-VF4UF1: %[[RDX_PHI:.*]] = phi float [ %[[FADD]], %for.inc ], [ %[[RDX]], %middle.block ]
	; CHECK: ret float %[[RDX_PHI]]			; CHECK-VF4UF1: ret float %[[RDX_PHI]]
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body			for.body: ; preds = %for.body
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.inc ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.inc ]
	%res = phi float [ 1.000000e+00, %entry ], [ %fadd, %for.inc ]			%res = phi float [ 1.000000e+00, %entry ], [ %fadd, %for.inc ]
	%arrayidx = getelementptr inbounds float, float* %b, i64 %iv			%arrayidx = getelementptr inbounds float, float* %b, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%tobool = fcmp une float %0, 0.000000e+00			%tobool = fcmp une float %0, 0.000000e+00
	br i1 %tobool, label %if.then, label %for.inc			br i1 %tobool, label %if.then, label %for.inc

	if.then: ; preds = %for.body			if.then: ; preds = %for.body
	%arrayidx2 = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx2 = getelementptr inbounds float, float* %a, i64 %iv
	%1 = load float, float* %arrayidx2, align 4			%1 = load float, float* %arrayidx2, align 4
	br label %for.inc			br label %for.inc

	for.inc:			for.inc:
	%phi = phi float [ %1, %if.then ], [ 3.000000e+00, %for.body ]			%phi = phi float [ %1, %if.then ], [ 3.000000e+00, %for.body ]
	%fadd = fadd float %res, %phi			%fadd = fadd float %res, %phi
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !2			br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

	for.end:			for.end:
	%rdx = phi float [ %fadd, %for.inc ]			%rdx = phi float [ %fadd, %for.inc ]
	ret float %rdx			ret float %rdx
	}			}

	; Negative test - loop contains multiple fadds which we cannot safely reorder			; Negative test - loop contains multiple fadds which we cannot safely reorder
	; Note: This test vectorizes the loop with a non-strict implementation, which reorders the FAdd operations.			; Note: This test vectorizes the loop with a non-strict implementation, which reorders the FAdd operations.
	; This is happening because we are using hints, where allowReordering returns true.			; This is happening because we are using hints, where allowReordering returns true.
	define float @fadd_multiple(float* noalias nocapture %a, float* noalias nocapture %b, i64 %n) {			define float @fadd_multiple(float* noalias nocapture %a, float* noalias nocapture %b, i64 %n) {
	; CHECK-LABEL: @fadd_multiple			; CHECK-VF8UF1-LABEL: @fadd_multiple
	; CHECK: vector.body			; CHECK-VF8UF1: vector.body
	; CHECK: %[[PHI:.]] = phi <vscale x 8 x float> [ insertelement (<vscale x 8 x float> shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> undef, float -0.000000e+00, i32 0), <vscale x 8 x float> undef, <vscale x 8 x i32> zeroinitializer), float -0.000000e+00, i32 0), %vector.ph ], [ %[[VEC_FADD2:.]], %vector.body ]			; CHECK-VF8UF1: %[[PHI:.]] = phi <vscale x 8 x float> [ insertelement (<vscale x 8 x float> shufflevector (<vscale x 8 x float> insertelement (<vscale x 8 x float> undef, float -0.000000e+00, i32 0), <vscale x 8 x float> undef, <vscale x 8 x i32> zeroinitializer), float -0.000000e+00, i32 0), %vector.ph ], [ %[[VEC_FADD2:.]], %vector.body ]
	; CHECK: %[[VEC_LOAD1:.*]] = load <vscale x 8 x float>, <vscale x 8 x float>			; CHECK-VF8UF1: %[[VEC_LOAD1:.*]] = load <vscale x 8 x float>, <vscale x 8 x float>
	; CHECK: %[[VEC_FADD1:.*]] = fadd <vscale x 8 x float> %[[PHI]], %[[VEC_LOAD1]]			; CHECK-VF8UF1: %[[VEC_FADD1:.*]] = fadd <vscale x 8 x float> %[[PHI]], %[[VEC_LOAD1]]
	; CHECK: %[[VEC_LOAD2:.*]] = load <vscale x 8 x float>, <vscale x 8 x float>			; CHECK-VF8UF1: %[[VEC_LOAD2:.*]] = load <vscale x 8 x float>, <vscale x 8 x float>
	; CHECK: %[[VEC_FADD2]] = fadd <vscale x 8 x float> %[[VEC_FADD1]], %[[VEC_LOAD2]]			; CHECK-VF8UF1: %[[VEC_FADD2]] = fadd <vscale x 8 x float> %[[VEC_FADD1]], %[[VEC_LOAD2]]
	; CHECK: middle.block			; CHECK-VF8UF1: middle.block
	; CHECK: %[[RDX:.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float -0.000000e+00, <vscale x 8 x float> %[[VEC_FADD2]])			; CHECK-VF8UF1: %[[RDX:.*]] = call float @llvm.vector.reduce.fadd.nxv8f32(float -0.000000e+00, <vscale x 8 x float> %[[VEC_FADD2]])
	; CHECK: for.body			; CHECK-VF8UF1: for.body
	; CHECK: %[[SUM:.]] = phi float [ %bc.merge.rdx, %scalar.ph ], [ %[[FADD2:.]], %for.body ]			; CHECK-VF8UF1: %[[SUM:.]] = phi float [ %bc.merge.rdx, %scalar.ph ], [ %[[FADD2:.]], %for.body ]
	; CHECK: %[[LOAD1:.]] = load float, float			; CHECK-VF8UF1: %[[LOAD1:.]] = load float, float
	; CHECK: %[[FADD1:.*]] = fadd float %[[SUM]], %[[LOAD1]]			; CHECK-VF8UF1: %[[FADD1:.*]] = fadd float %[[SUM]], %[[LOAD1]]
	; CHECK: %[[LOAD2:.]] = load float, float			; CHECK-VF8UF1: %[[LOAD2:.]] = load float, float
	; CHECK: %[[FADD2]] = fadd float %[[FADD1]], %[[LOAD2]]			; CHECK-VF8UF1: %[[FADD2]] = fadd float %[[FADD1]], %[[LOAD2]]
	; CHECK: for.end			; CHECK-VF8UF1: for.end
	; CHECK: %[[RET:.*]] = phi float [ %[[FADD2]], %for.body ], [ %[[RDX]], %middle.block ]			; CHECK-VF8UF1: %[[RET:.*]] = phi float [ %[[FADD2]], %for.body ], [ %[[RDX]], %middle.block ]
	; CHECK: ret float %[[RET]]			; CHECK-VF8UF1: ret float %[[RET]]
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum = phi float [ -0.000000e+00, %entry ], [ %add3, %for.body ]			%sum = phi float [ -0.000000e+00, %entry ], [ %add3, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%add = fadd float %sum, %0			%add = fadd float %sum, %0
	%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv			%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
	%1 = load float, float* %arrayidx2, align 4			%1 = load float, float* %arrayidx2, align 4
	%add3 = fadd float %add, %1			%add3 = fadd float %add, %1
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0			br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	%rdx = phi float [ %add3, %for.body ]			%rdx = phi float [ %add3, %for.body ]
	ret float %rdx			ret float %rdx
	}			}

	!0 = distinct !{!0, !3, !6, !8}			!0 = distinct !{!0, !1}
	!1 = distinct !{!1, !3, !7, !8}			!1 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
	!2 = distinct !{!2, !4, !6, !8}
	!3 = !{!"llvm.loop.vectorize.width", i32 8}
	!4 = !{!"llvm.loop.vectorize.width", i32 4}
	!5 = !{!"llvm.loop.vectorize.width", i32 2}
	!6 = !{!"llvm.loop.interleave.count", i32 1}
	!7 = !{!"llvm.loop.interleave.count", i32 4}
	!8 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll

	; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -enable-strict-reductions -S \| FileCheck %s -check-prefix=CHECK			; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -enable-strict-reductions -force-vector-width=8 -force-vector-interleave=1 -S \| FileCheck %s -check-prefix=CHECK-VF8UF1
				; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -enable-strict-reductions -force-vector-width=8 -force-vector-interleave=4 -S \| FileCheck %s -check-prefix=CHECK-VF8UF4
				; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -enable-strict-reductions -force-vector-width=4 -force-vector-interleave=1 -S \| FileCheck %s -check-prefix=CHECK-VF4UF1
				sdesmalenUnsubmitted Not Done Reply Inline Actions Can you also add a RUN line for `-enable-strict-reductions=true -hints-allow-reordering=true` (which I think can reuse prefix CHECK-UNORDERED) sdesmalen: Can you also add a RUN line for `-enable-strict-reductions=true -hints-allow-reordering=true`…
				kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Hi @sdesmalen, I haven't added another RUN line for `-enable-strict-reductions=true -hints-allow-reordering=true` as this will cause failures with the fadd_multiple test. For most of the tests, the output where both flags are true will match the CHECK-ORDERED lines, since we will always use strict reductions where possible if this flag is set. For fadd_multiple, we cannot use strict reductions and so the value of `-hints-allow-reordering` will change whether or not the test vectorizes. As we discussed previously, I will follow this up with a patch which ensures we only choose strict reductions if we do not allow reordering. At this point I can add a RUN line as you've suggested and reuse the `CHECK-UNORDERED` prefix. kmclaughlin: Hi @sdesmalen, I haven't added another RUN line for `-enable-strict-reductions=true -hints…
				sdesmalenUnsubmitted Not Done Reply Inline Actions Fair enough, thanks for explaining. Can you just add a FIXME above the `cl::opt` for `-enable-strict-reductions` that this flag reverses the default behaviour we have now when hints are passed? sdesmalen: Fair enough, thanks for explaining. Can you just add a FIXME above the `cl::opt` for `-enable…
				; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -enable-strict-reductions -force-vector-width=2 -force-vector-interleave=1 -S \| FileCheck %s -check-prefix=CHECK-PRED
				; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -enable-strict-reductions -force-vector-width=1 -force-vector-interleave=1 -S \| FileCheck %s -check-prefix=CHECK-SCALAR
				; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -enable-strict-reductions -pass-remarks=loop-vectorize -pass-remarks-analysis=loop-vectorize -S 2>%t \| FileCheck %s -check-prefix=CHECK-NO-HINTS
				; RUN: cat %t \| FileCheck %s -check-prefix=CHECK-REMARKS
				sdesmalenUnsubmitted Not Done Reply Inline Actions The first test, `@fadd_strict`, has check lines that match the first RUN line, and the 4th RUN line, but none of the others. Why is that? And are all these RUN lines needed? sdesmalen: The first test, `@fadd_strict`, has check lines that match the first RUN line, and the 4th RUN…
				kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Since the allowReordering() function returns false if `EC.getKnownMinValue() > 1`, I thought it was worth making sure that we don't vectorize a VF of 1 for at least one of the tests, which is why I added the extra RUN line to `@fadd_strict`. The RUN lines are needed so that we can pass the different VFs & interleave counts needed for each of the tests (e.g. `@fadd_strict_unroll` needs a UF > 1) and I didn't want to change the 'allow reordering' behaviour by passing hints through metadata. Though I think I could remove the `CHECK-PRED` line since the `@fadd_predicated` does rely on metadata if this would help at all? kmclaughlin: Since the allowReordering() function returns false if `EC.getKnownMinValue() > 1`, I thought it…

				; CHECK-REMARKS: vectorized loop (vectorization width: 2, interleaved count: 2)
	define float @fadd_strict(float* noalias nocapture readonly %a, i64 %n) {			define float @fadd_strict(float* noalias nocapture readonly %a, i64 %n) {
	; CHECK-LABEL: @fadd_strict			; CHECK-VF8UF1-LABEL: @fadd_strict
	; CHECK: vector.body:			; CHECK-VF8UF1: vector.body:
	; CHECK: %[[VEC_PHI:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]			; CHECK-VF8UF1: %[[VEC_PHI:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]
	; CHECK: %[[LOAD:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF1: %[[LOAD:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[RDX]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[VEC_PHI]], <8 x float> %[[LOAD]])			; CHECK-VF8UF1: %[[RDX]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[VEC_PHI]], <8 x float> %[[LOAD]])
	; CHECK: for.end			; CHECK-VF8UF1: for.end
	; CHECK: %[[PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX]], %middle.block ]			; CHECK-VF8UF1: %[[PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX]], %middle.block ]
	; CHECK: ret float %[[PHI]]			; CHECK-VF8UF1: ret float %[[PHI]]

				; CHECK-SCALAR-NOT: vector.body
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%add = fadd float %0, %sum.07			%add = fadd float %0, %sum.07
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0			br i1 %exitcond.not, label %for.end, label %for.body

	for.end:			for.end:
	ret float %add			ret float %add
	}			}

				; CHECK-REMARKS: vectorized loop (vectorization width: 2, interleaved count: 2)
	define float @fadd_strict_unroll(float* noalias nocapture readonly %a, i64 %n) {			define float @fadd_strict_unroll(float* noalias nocapture readonly %a, i64 %n) {
	; CHECK-LABEL: @fadd_strict_unroll			; CHECK-VF8UF4-LABEL: @fadd_strict_unroll
	; CHECK: vector.body:			; CHECK-VF8UF4: vector.body:
	; CHECK: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4:.]], %vector.body ]			; CHECK-VF8UF4: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4:.]], %vector.body ]
	; CHECK-NOT: phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4]], %vector.body ]			; CHECK-VF8UF4-NOT: phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4]], %vector.body ]
	; CHECK: %[[LOAD1:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF4: %[[LOAD1:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[LOAD2:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF4: %[[LOAD2:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[LOAD3:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF4: %[[LOAD3:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[LOAD4:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF4: %[[LOAD4:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[RDX1:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[VEC_PHI1]], <8 x float> %[[LOAD1]])			; CHECK-VF8UF4: %[[RDX1:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[VEC_PHI1]], <8 x float> %[[LOAD1]])
	; CHECK: %[[RDX2:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX1]], <8 x float> %[[LOAD2]])			; CHECK-VF8UF4: %[[RDX2:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX1]], <8 x float> %[[LOAD2]])
	; CHECK: %[[RDX3:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX2]], <8 x float> %[[LOAD3]])			; CHECK-VF8UF4: %[[RDX3:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX2]], <8 x float> %[[LOAD3]])
	; CHECK: %[[RDX4]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX3]], <8 x float> %[[LOAD4]])			; CHECK-VF8UF4: %[[RDX4]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX3]], <8 x float> %[[LOAD4]])
	; CHECK: for.end			; CHECK-VF8UF4: for.end
	; CHECK: %[[PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX4]], %middle.block ]			; CHECK-VF8UF4: %[[PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX4]], %middle.block ]
	; CHECK: ret float %[[PHI]]			; CHECK-VF8UF4: ret float %[[PHI]]
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%add = fadd float %0, %sum.07			%add = fadd float %0, %sum.07
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !1			br i1 %exitcond.not, label %for.end, label %for.body

	for.end:			for.end:
	ret float %add			ret float %add
	}			}

	; An additional test for unrolling where we need the last value of the reduction, i.e:			; An additional test for unrolling where we need the last value of the reduction, i.e:
	; float sum = 0, sum2;			; float sum = 0, sum2;
	; for(int i=0; i<N; ++i) {			; for(int i=0; i<N; ++i) {
	; sum += ptr[i];			; sum += ptr[i];
	; *ptr2 = sum + 42;			; *ptr2 = sum + 42;
	; }			; }
	; return sum;			; return sum;

				; CHECK-REMARKS: vectorized loop (vectorization width: 2, interleaved count: 2)
	define float @fadd_strict_unroll_last_val(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {			define float @fadd_strict_unroll_last_val(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {
	; CHECK-LABEL: @fadd_strict_unroll_last_val			; CHECK-VF8UF2-LABEL: @fadd_strict_unroll_last_val
	; CHECK: vector.body			; CHECK-VF8UF2: vector.body
	; CHECK: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4:.]], %vector.body ]			; CHECK-VF8UF2: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4:.]], %vector.body ]
	; CHECK-NOT: phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4]], %vector.body ]			; CHECK-VF8UF2-NOT: phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX4]], %vector.body ]
	; CHECK: %[[LOAD1:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF2: %[[LOAD1:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[LOAD2:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF2: %[[LOAD2:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[LOAD3:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF2: %[[LOAD3:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[LOAD4:.]] = load <8 x float>, <8 x float>			; CHECK-VF8UF2: %[[LOAD4:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[RDX1:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[VEC_PHI1]], <8 x float> %[[LOAD1]])			; CHECK-VF8UF2: %[[RDX1:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[VEC_PHI1]], <8 x float> %[[LOAD1]])
	; CHECK: %[[RDX2:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX1]], <8 x float> %[[LOAD2]])			; CHECK-VF8UF2: %[[RDX2:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX1]], <8 x float> %[[LOAD2]])
	; CHECK: %[[RDX3:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX2]], <8 x float> %[[LOAD3]])			; CHECK-VF8UF2: %[[RDX3:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX2]], <8 x float> %[[LOAD3]])
	; CHECK: %[[RDX4]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX3]], <8 x float> %[[LOAD4]])			; CHECK-VF8UF2: %[[RDX4]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[RDX3]], <8 x float> %[[LOAD4]])
	; CHECK: for.body			; CHECK-VF8UF2: for.body
	; CHECK: %[[SUM_PHI:.]] = phi float [ %[[FADD:.]], %for.body ], [ {{.*}}, %scalar.ph ]			; CHECK-VF8UF2: %[[SUM_PHI:.]] = phi float [ %[[FADD:.]], %for.body ], [ {{.*}}, %scalar.ph ]
	; CHECK: %[[LOAD5:.]] = load float, float			; CHECK-VF8UF2: %[[LOAD5:.]] = load float, float
	; CHECK: %[[FADD]] = fadd float %[[SUM_PHI]], %[[LOAD5]]			; CHECK-VF8UF2: %[[FADD]] = fadd float %[[SUM_PHI]], %[[LOAD5]]
	; CHECK: for.cond.cleanup			; CHECK-VF8UF2: for.cond.cleanup
	; CHECK: %[[FADD_LCSSA:.*]] = phi float [ %[[FADD]], %for.body ], [ %[[RDX4]], %middle.block ]			; CHECK-VF8UF2: %[[FADD_LCSSA:.*]] = phi float [ %[[FADD]], %for.body ], [ %[[RDX4]], %middle.block ]
	; CHECK: %[[FADD_42:.*]] = fadd float %[[FADD_LCSSA]], 4.200000e+01			; CHECK-VF8UF2: %[[FADD_42:.*]] = fadd float %[[FADD_LCSSA]], 4.200000e+01
	; CHECK: store float %[[FADD_42]], float* %b			; CHECK-VF8UF2: store float %[[FADD_42]], float* %b
	; CHECK: for.end			; CHECK-VF8UF2: for.end
	; CHECK: %[[SUM_LCSSA:.*]] = phi float [ %[[FADD_LCSSA]], %for.cond.cleanup ], [ 0.000000e+00, %entry ]			; CHECK-VF8UF2: %[[SUM_LCSSA:.*]] = phi float [ %[[FADD_LCSSA]], %for.cond.cleanup ], [ 0.000000e+00, %entry ]
	; CHECK: ret float %[[SUM_LCSSA]]			; CHECK-VF8UF2: ret float %[[SUM_LCSSA]]
				david-armUnsubmitted Done Reply Inline Actions Hi @kmclaughlin, I think maybe this is meant to be CHECK-VF8UF4? david-arm: Hi @kmclaughlin, I think maybe this is meant to be CHECK-VF8UF4?
	entry:			entry:
	%cmp = icmp sgt i64 %n, 0			%cmp = icmp sgt i64 %n, 0
	br i1 %cmp, label %for.body, label %for.end			br i1 %cmp, label %for.body, label %for.end

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum = phi float [ 0.000000e+00, %entry ], [ %fadd, %for.body ]			%sum = phi float [ 0.000000e+00, %entry ], [ %fadd, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%fadd = fadd float %sum, %0			%fadd = fadd float %sum, %0
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !1			br i1 %exitcond.not, label %for.cond.cleanup, label %for.body

	for.cond.cleanup:			for.cond.cleanup:
	%fadd.lcssa = phi float [ %fadd, %for.body ]			%fadd.lcssa = phi float [ %fadd, %for.body ]
	%fadd2 = fadd float %fadd.lcssa, 4.200000e+01			%fadd2 = fadd float %fadd.lcssa, 4.200000e+01
	store float %fadd2, float* %b, align 4			store float %fadd2, float* %b, align 4
	br label %for.end			br label %for.end

	for.end:			for.end:
	%sum.lcssa = phi float [ %fadd.lcssa, %for.cond.cleanup ], [ 0.000000e+00, %entry ]			%sum.lcssa = phi float [ %fadd.lcssa, %for.cond.cleanup ], [ 0.000000e+00, %entry ]
	ret float %sum.lcssa			ret float %sum.lcssa
	}			}

				; CHECK-REMARKS: vectorized loop (vectorization width: 2, interleaved count: 2)
	define void @fadd_strict_interleave(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {			define void @fadd_strict_interleave(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {
	; CHECK-LABEL: @fadd_strict_interleave			; CHECK-VF4UF1-LABEL: @fadd_strict_interleave
	; CHECK: entry			; CHECK-VF4UF1: entry
	; CHECK: %[[ARRAYIDX:.]] = getelementptr inbounds float, float %a, i64 1			; CHECK-VF4UF1: %[[ARRAYIDX:.]] = getelementptr inbounds float, float %a, i64 1
	; CHECK: %[[LOAD1:.]] = load float, float %a			; CHECK-VF4UF1: %[[LOAD1:.]] = load float, float %a
	; CHECK: %[[LOAD2:.]] = load float, float %[[ARRAYIDX]]			; CHECK-VF4UF1: %[[LOAD2:.]] = load float, float %[[ARRAYIDX]]
	; CHECK: vector.body			; CHECK-VF4UF1: vector.body
	; CHECK: %[[VEC_PHI1:.]] = phi float [ %[[LOAD2]], %vector.ph ], [ %[[RDX2:.]], %vector.body ]			; CHECK-VF4UF1: %[[VEC_PHI1:.]] = phi float [ %[[LOAD2]], %vector.ph ], [ %[[RDX2:.]], %vector.body ]
	; CHECK: %[[VEC_PHI2:.]] = phi float [ %[[LOAD1]], %vector.ph ], [ %[[RDX1:.]], %vector.body ]			; CHECK-VF4UF1: %[[VEC_PHI2:.]] = phi float [ %[[LOAD1]], %vector.ph ], [ %[[RDX1:.]], %vector.body ]
	; CHECK: %[[WIDE_LOAD:.]] = load <8 x float>, <8 x float>			; CHECK-VF4UF1: %[[WIDE_LOAD:.]] = load <8 x float>, <8 x float>
	; CHECK: %[[STRIDED1:.*]] = shufflevector <8 x float> %[[WIDE_LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK-VF4UF1: %[[STRIDED1:.*]] = shufflevector <8 x float> %[[WIDE_LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK: %[[STRIDED2:.*]] = shufflevector <8 x float> %[[WIDE_LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; CHECK-VF4UF1: %[[STRIDED2:.*]] = shufflevector <8 x float> %[[WIDE_LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; CHECK: %[[RDX1]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[VEC_PHI2]], <4 x float> %[[STRIDED1]])			; CHECK-VF4UF1: %[[RDX1]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[VEC_PHI2]], <4 x float> %[[STRIDED1]])
	; CHECK: %[[RDX2]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[VEC_PHI1]], <4 x float> %[[STRIDED2]])			; CHECK-VF4UF1: %[[RDX2]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[VEC_PHI1]], <4 x float> %[[STRIDED2]])
	; CHECK: for.end			; CHECK-VF4UF1: for.end
	; CHECK ret void			; CHECK-VF4UF1: ret void
	entry:			entry:
	%arrayidxa = getelementptr inbounds float, float* %a, i64 1			%arrayidxa = getelementptr inbounds float, float* %a, i64 1
	%a1 = load float, float* %a, align 4			%a1 = load float, float* %a, align 4
	%a2 = load float, float* %arrayidxa, align 4			%a2 = load float, float* %arrayidxa, align 4
	br label %for.body			br label %for.body

	for.body:			for.body:
	%add.phi1 = phi float [ %a2, %entry ], [ %add2, %for.body ]			%add.phi1 = phi float [ %a2, %entry ], [ %add2, %for.body ]
	%add.phi2 = phi float [ %a1, %entry ], [ %add1, %for.body ]			%add.phi2 = phi float [ %a1, %entry ], [ %add1, %for.body ]
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%arrayidxb1 = getelementptr inbounds float, float* %b, i64 %iv			%arrayidxb1 = getelementptr inbounds float, float* %b, i64 %iv
	%0 = load float, float* %arrayidxb1, align 4			%0 = load float, float* %arrayidxb1, align 4
	%add1 = fadd float %0, %add.phi2			%add1 = fadd float %0, %add.phi2
	%or = or i64 %iv, 1			%or = or i64 %iv, 1
	%arrayidxb2 = getelementptr inbounds float, float* %b, i64 %or			%arrayidxb2 = getelementptr inbounds float, float* %b, i64 %or
	%1 = load float, float* %arrayidxb2, align 4			%1 = load float, float* %arrayidxb2, align 4
	%add2 = fadd float %1, %add.phi1			%add2 = fadd float %1, %add.phi1
	%iv.next = add nuw nsw i64 %iv, 2			%iv.next = add nuw nsw i64 %iv, 2
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !2			br i1 %exitcond.not, label %for.end, label %for.body

	for.end:			for.end:
	store float %add1, float* %a, align 4			store float %add1, float* %a, align 4
	store float %add2, float* %arrayidxa, align 4			store float %add2, float* %arrayidxa, align 4
	ret void			ret void
	}			}

				; CHECK-REMARKS: vectorized loop (vectorization width: 4, interleaved count: 2)
	define float @fadd_invariant(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {			define float @fadd_invariant(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {
	; CHECK-LABEL: @fadd_invariant			; CHECK-VF4UF1-LABEL: @fadd_invariant
	; CHECK: vector.body			; CHECK-VF4UF1: vector.body
	; CHECK: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]			; CHECK-VF4UF1: %[[VEC_PHI1:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %vector.body ]
	; CHECK: %[[LOAD1:.]] = load <4 x float>, <4 x float>			; CHECK-VF4UF1: %[[LOAD1:.]] = load <4 x float>, <4 x float>
	; CHECK: %[[LOAD2:.]] = load <4 x float>, <4 x float>			; CHECK-VF4UF1: %[[LOAD2:.]] = load <4 x float>, <4 x float>
	; CHECK: %[[ADD:.*]] = fadd <4 x float> %[[LOAD1]], %[[LOAD2]]			; CHECK-VF4UF1: %[[ADD:.*]] = fadd <4 x float> %[[LOAD1]], %[[LOAD2]]
	; CHECK: %[[RDX]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[VEC_PHI1]], <4 x float> %[[ADD]])			; CHECK-VF4UF1: %[[RDX]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[VEC_PHI1]], <4 x float> %[[ADD]])
	; CHECK: for.end.loopexit			; CHECK-VF4UF1: for.end.loopexit
	; CHECK: %[[EXIT_PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX]], %middle.block ]			; CHECK-VF4UF1: %[[EXIT_PHI:.]] = phi float [ %[[SCALAR:.]], %for.body ], [ %[[RDX]], %middle.block ]
	; CHECK: for.end			; CHECK-VF4UF1: for.end
	; CHECK: %[[PHI:.*]] = phi float [ 0.000000e+00, %entry ], [ %[[EXIT_PHI]], %for.end.loopexit ]			; CHECK-VF4UF1: %[[PHI:.*]] = phi float [ 0.000000e+00, %entry ], [ %[[EXIT_PHI]], %for.end.loopexit ]
	; CHECK: ret float %[[PHI]]			; CHECK-VF4UF1: ret float %[[PHI]]
	entry:			entry:
	%arrayidx = getelementptr inbounds float, float* %a, i64 1			%arrayidx = getelementptr inbounds float, float* %a, i64 1
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%cmp1 = fcmp ogt float %0, 5.000000e-01			%cmp1 = fcmp ogt float %0, 5.000000e-01
	br i1 %cmp1, label %for.body, label %for.end			br i1 %cmp1, label %for.body, label %for.end

	for.body: ; preds = %for.body			for.body: ; preds = %for.body
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%res.014 = phi float [ 0.000000e+00, %entry ], [ %rdx, %for.body ]			%res.014 = phi float [ 0.000000e+00, %entry ], [ %rdx, %for.body ]
	%arrayidx2 = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx2 = getelementptr inbounds float, float* %a, i64 %iv
	%1 = load float, float* %arrayidx2, align 4			%1 = load float, float* %arrayidx2, align 4
	%arrayidx4 = getelementptr inbounds float, float* %b, i64 %iv			%arrayidx4 = getelementptr inbounds float, float* %b, i64 %iv
	%2 = load float, float* %arrayidx4, align 4			%2 = load float, float* %arrayidx4, align 4
	%add = fadd float %1, %2			%add = fadd float %1, %2
	%rdx = fadd float %res.014, %add			%rdx = fadd float %res.014, %add
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !2			br i1 %exitcond.not, label %for.end, label %for.body

	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	%res = phi float [ 0.000000e+00, %entry ], [ %rdx, %for.body ]			%res = phi float [ 0.000000e+00, %entry ], [ %rdx, %for.body ]
	ret float %res			ret float %res
	}			}

				; CHECK-REMARKS: the cost-model indicates that vectorization is not beneficial
	define float @fadd_conditional(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {			define float @fadd_conditional(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {
	; CHECK-LABEL: @fadd_conditional			; CHECK-VF4UF1-LABEL: @fadd_conditional
	; CHECK: vector.body:			; CHECK-VF4UF1: vector.body:
	; CHECK: %[[PHI:.]] = phi float [ 1.000000e+00, %vector.ph ], [ %[[RDX:.]], %pred.load.continue6 ]			; CHECK-VF4UF1: %[[PHI:.]] = phi float [ 1.000000e+00, %vector.ph ], [ %[[RDX:.]], %pred.load.continue6 ]
	; CHECK: %[[LOAD1:.]] = load <4 x float>, <4 x float>			; CHECK-VF4UF1: %[[LOAD1:.]] = load <4 x float>, <4 x float>
	; CHECK: %[[FCMP1:.*]] = fcmp une <4 x float> %[[LOAD1]], zeroinitializer			; CHECK-VF4UF1: %[[FCMP1:.*]] = fcmp une <4 x float> %[[LOAD1]], zeroinitializer
	; CHECK: %[[EXTRACT:.*]] = extractelement <4 x i1> %[[FCMP1]], i32 0			; CHECK-VF4UF1: %[[EXTRACT:.*]] = extractelement <4 x i1> %[[FCMP1]], i32 0
	; CHECK: br i1 %[[EXTRACT]], label %pred.load.if, label %pred.load.continue			; CHECK-VF4UF1: br i1 %[[EXTRACT]], label %pred.load.if, label %pred.load.continue
	; CHECK: pred.load.continue6			; CHECK-VF4UF1: pred.load.continue6
	; CHECK: %[[PHI1:.]] = phi <4 x float> [ %[[PHI0:.]], %pred.load.continue4 ], [ %[[INS_ELT:.*]], %pred.load.if5 ]			; CHECK-VF4UF1: %[[PHI1:.]] = phi <4 x float> [ %[[PHI0:.]], %pred.load.continue4 ], [ %[[INS_ELT:.*]], %pred.load.if5 ]
	; CHECK: %[[XOR:.*]] = xor <4 x i1> %[[FCMP1]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-VF4UF1: %[[XOR:.*]] = xor <4 x i1> %[[FCMP1]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK: %[[PRED:.*]] = select <4 x i1> %[[XOR]], <4 x float> <float 3.000000e+00, float 3.000000e+00, float 3.000000e+00, float 3.000000e+00>, <4 x float> %[[PHI1]]			; CHECK-VF4UF1: %[[PRED:.*]] = select <4 x i1> %[[XOR]], <4 x float> <float 3.000000e+00, float 3.000000e+00, float 3.000000e+00, float 3.000000e+00>, <4 x float> %[[PHI1]]
	; CHECK: %[[RDX]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[PHI]], <4 x float> %[[PRED]])			; CHECK-VF4UF1: %[[RDX]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[PHI]], <4 x float> %[[PRED]])
	; CHECK: for.body			; CHECK-VF4UF1: for.body
	; CHECK: %[[RES_PHI:.]] = phi float [ %[[MERGE_RDX:.]], %scalar.ph ], [ %[[FADD:.*]], %for.inc ]			; CHECK-VF4UF1: %[[RES_PHI:.]] = phi float [ %[[MERGE_RDX:.]], %scalar.ph ], [ %[[FADD:.*]], %for.inc ]
	; CHECK: %[[LOAD2:.]] = load float, float			; CHECK-VF4UF1: %[[LOAD2:.]] = load float, float
	; CHECK: %[[FCMP2:.*]] = fcmp une float %[[LOAD2]], 0.000000e+00			; CHECK-VF4UF1: %[[FCMP2:.*]] = fcmp une float %[[LOAD2]], 0.000000e+00
	; CHECK: br i1 %[[FCMP2]], label %if.then, label %for.inc			; CHECK-VF4UF1: br i1 %[[FCMP2]], label %if.then, label %for.inc
	; CHECK: if.then			; CHECK-VF4UF1: if.then
	; CHECK: %[[LOAD3:.]] = load float, float			; CHECK-VF4UF1: %[[LOAD3:.]] = load float, float
	; CHECK: br label %for.inc			; CHECK-VF4UF1: br label %for.inc
	; CHECK: for.inc			; CHECK-VF4UF1: for.inc
	; CHECK: %[[PHI2:.*]] = phi float [ %[[LOAD3]], %if.then ], [ 3.000000e+00, %for.body ]			; CHECK-VF4UF1: %[[PHI2:.*]] = phi float [ %[[LOAD3]], %if.then ], [ 3.000000e+00, %for.body ]
	; CHECK: %[[FADD]] = fadd float %[[RES_PHI]], %[[PHI2]]			; CHECK-VF4UF1: %[[FADD]] = fadd float %[[RES_PHI]], %[[PHI2]]
	; CHECK: for.end			; CHECK-VF4UF1: for.end
	; CHECK: %[[RDX_PHI:.*]] = phi float [ %[[FADD]], %for.inc ], [ %[[RDX]], %middle.block ]			; CHECK-VF4UF1: %[[RDX_PHI:.*]] = phi float [ %[[FADD]], %for.inc ], [ %[[RDX]], %middle.block ]
	; CHECK: ret float %[[RDX_PHI]]			; CHECK-VF4UF1: ret float %[[RDX_PHI]]
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body			for.body: ; preds = %for.body
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.inc ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.inc ]
	%res = phi float [ 1.000000e+00, %entry ], [ %fadd, %for.inc ]			%res = phi float [ 1.000000e+00, %entry ], [ %fadd, %for.inc ]
	%arrayidx = getelementptr inbounds float, float* %b, i64 %iv			%arrayidx = getelementptr inbounds float, float* %b, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%tobool = fcmp une float %0, 0.000000e+00			%tobool = fcmp une float %0, 0.000000e+00
	br i1 %tobool, label %if.then, label %for.inc			br i1 %tobool, label %if.then, label %for.inc

	if.then: ; preds = %for.body			if.then: ; preds = %for.body
	%arrayidx2 = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx2 = getelementptr inbounds float, float* %a, i64 %iv
	%1 = load float, float* %arrayidx2, align 4			%1 = load float, float* %arrayidx2, align 4
	br label %for.inc			br label %for.inc

	for.inc:			for.inc:
	%phi = phi float [ %1, %if.then ], [ 3.000000e+00, %for.body ]			%phi = phi float [ %1, %if.then ], [ 3.000000e+00, %for.body ]
	%fadd = fadd float %res, %phi			%fadd = fadd float %res, %phi
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !2			br i1 %exitcond.not, label %for.end, label %for.body

	for.end:			for.end:
	%rdx = phi float [ %fadd, %for.inc ]			%rdx = phi float [ %fadd, %for.inc ]
	ret float %rdx			ret float %rdx
	}			}

	; Test to check masking correct, using the "llvm.loop.vectorize.predicate.enable" attribute			; Test to check masking correct, using the "llvm.loop.vectorize.predicate.enable" attribute
				; CHECK-REMARKS: interleaved loop (interleaved count: 2)
	define float @fadd_predicated(float* noalias nocapture %a, i64 %n) {			define float @fadd_predicated(float* noalias nocapture %a, i64 %n) {
	; CHECK-LABEL: @fadd_predicated			; CHECK-PRED-LABEL: @fadd_predicated
	; CHECK: vector.ph			; CHECK-PRED: vector.ph
	; CHECK: %[[TRIP_MINUS_ONE:.*]] = sub i64 %n, 1			; CHECK-PRED: %[[TRIP_MINUS_ONE:.*]] = sub i64 %n, 1
	; CHECK: %[[BROADCAST_INS:.*]] = insertelement <2 x i64> poison, i64 %[[TRIP_MINUS_ONE]], i32 0			; CHECK-PRED: %[[BROADCAST_INS:.*]] = insertelement <2 x i64> poison, i64 %[[TRIP_MINUS_ONE]], i32 0
	; CHECK: %[[SPLAT:.*]] = shufflevector <2 x i64> %[[BROADCAST_INS]], <2 x i64> poison, <2 x i32> zeroinitializer			; CHECK-PRED: %[[SPLAT:.*]] = shufflevector <2 x i64> %[[BROADCAST_INS]], <2 x i64> poison, <2 x i32> zeroinitializer
	; CHECK: vector.body			; CHECK-PRED: vector.body
	; CHECK: %[[RDX_PHI:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %pred.load.continue2 ]			; CHECK-PRED: %[[RDX_PHI:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[RDX:.]], %pred.load.continue2 ]
	; CHECK: pred.load.continue2			; CHECK-PRED: pred.load.continue2
	; CHECK: %[[PHI:.]] = phi <2 x float> [ %[[PHI0:.]], %pred.load.continue ], [ %[[INS_ELT:.*]], %pred.load.if1 ]			; CHECK-PRED: %[[PHI:.]] = phi <2 x float> [ %[[PHI0:.]], %pred.load.continue ], [ %[[INS_ELT:.*]], %pred.load.if1 ]
	; CHECK: %[[MASK:.*]] = select <2 x i1> %0, <2 x float> %[[PHI]], <2 x float> <float -0.000000e+00, float -0.000000e+00>			; CHECK-PRED: %[[MASK:.*]] = select <2 x i1> %0, <2 x float> %[[PHI]], <2 x float> <float -0.000000e+00, float -0.000000e+00>
	; CHECK: %[[RDX]] = call float @llvm.vector.reduce.fadd.v2f32(float %[[RDX_PHI]], <2 x float> %[[MASK]])			; CHECK-PRED: %[[RDX]] = call float @llvm.vector.reduce.fadd.v2f32(float %[[RDX_PHI]], <2 x float> %[[MASK]])
	; CHECK: for.end:			; CHECK-PRED: for.end:
	; CHECK: %[[RES_PHI:.]] = phi float [ %[[FADD:.]], %for.body ], [ %[[RDX]], %middle.block ]			; CHECK-PRED: %[[RES_PHI:.]] = phi float [ %[[FADD:.]], %for.body ], [ %[[RDX]], %middle.block ]
	; CHECK: ret float %[[RES_PHI]]			; CHECK-PRED: ret float %[[RES_PHI]]
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]			%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]
	%sum.02 = phi float [ %l7, %for.body ], [ 0.000000e+00, %entry ]			%sum.02 = phi float [ %l7, %for.body ], [ 0.000000e+00, %entry ]
	%l2 = getelementptr inbounds float, float* %a, i64 %iv			%l2 = getelementptr inbounds float, float* %a, i64 %iv
	%l3 = load float, float* %l2, align 4			%l3 = load float, float* %l2, align 4
	%l7 = fadd float %sum.02, %l3			%l7 = fadd float %sum.02, %l3
	%iv.next = add i64 %iv, 1			%iv.next = add i64 %iv, 1
	%exitcond = icmp eq i64 %iv.next, %n			%exitcond = icmp eq i64 %iv.next, %n
	br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !3			br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !0

	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	%sum.0.lcssa = phi float [ %l7, %for.body ]			%sum.0.lcssa = phi float [ %l7, %for.body ]
	ret float %sum.0.lcssa			ret float %sum.0.lcssa
	}			}

				; Test we can vectorize a loop both an FAdd which we can vectorize inloop and an integer add
				; CHECK-REMARKS: vectorized loop (vectorization width: 2, interleaved count: 2)
				define float @fadd_mixed(float* noalias nocapture readonly %a, i64* noalias nocapture readonly %b, i64 %n) {
				; CHECK-LABEL-VF4UF1: @fadd_mixed
				; CHECK-VF4UF1: vector.body:
				; CHECK-VF4UF1: %[[VEC_PHI_F32:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[FADD_RDX:.]], %vector.body ]
				; CHECK-VF4UF1: %[[VEC_PHI_INT:.]] = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %[[ADD:.]], %vector.body ]
				; CHECK-VF4UF1: %[[LOAD:.]] = load <4 x float>, <4 x float>
				; CHECK-VF4UF1: %[[FADD_RDX]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[VEC_PHI_F32]], <4 x float> %[[LOAD]])
				; CHECK-VF4UF1: %[[ADD]] = add <4 x i64> %vec.ind, %[[VEC_PHI_INT]]
				; CHECK-VF4UF1: middle.block
				; CHECK-VF4UF1: %[[ADD_RDX:.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %[[ADD]])
				; CHECK-VF4UF1: for.end
				; CHECK-VF4UF1: %[[FADD_PHI:.]] = phi float [ %[[SCALAR_FADD:.]], %for.body ], [ %[[FADD_RDX]], %middle.block ]
				; CHECK-VF4UF1: %[[ADD_PHI:.]] = phi i64 [ %[[SCALAR_ADD:.]], %for.body ], [ %[[ADD_RDX]], %middle.block ]
				; CHECK-VF4UF1: store i64 %[[ADD_PHI]], i64* %b
				; CHECK-VF4UF1: ret float %[[FADD_PHI]]
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.fadd = phi float [ 0.000000e+00, %entry ], [ %fadd, %for.body ]
				%sum.add = phi i64 [ 0, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
				%0 = load float, float* %arrayidx, align 4
				%fadd = fadd float %0, %sum.fadd
				%add = add i64 %iv, %sum.add
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end:
				store i64 %add, i64* %b, align 4
				ret float %fadd
				}

	; Negative test - loop contains multiple fadds which we cannot safely reorder			; Negative test - loop contains multiple fadds which we cannot safely reorder
				; CHECK-REMARKS: loop not vectorized: cannot prove it is safe to reorder floating-point operations
	define float @fadd_multiple(float* noalias nocapture %a, float* noalias nocapture %b, i64 %n) {			define float @fadd_multiple(float* noalias nocapture %a, float* noalias nocapture %b, i64 %n) {
	; CHECK-LABEL: @fadd_multiple			; CHECK-VF8UF1-LABEL: @fadd_multiple
	; CHECK: vector.body			; CHECK-VF8UF1: vector.body
	; CHECK: %[[PHI:.]] = phi <8 x float> [ <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %vector.ph ], [ %[[VEC_FADD2:.]], %vector.body ]			; CHECK-VF8UF1: %[[PHI:.]] = phi <8 x float> [ <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %vector.ph ], [ %[[VEC_FADD2:.]], %vector.body ]
	; CHECK: %[[VEC_LOAD1:.*]] = load <8 x float>, <8 x float>			; CHECK-VF8UF1: %[[VEC_LOAD1:.*]] = load <8 x float>, <8 x float>
	; CHECK: %[[VEC_FADD1:.*]] = fadd <8 x float> %[[PHI]], %[[VEC_LOAD1]]			; CHECK-VF8UF1: %[[VEC_FADD1:.*]] = fadd <8 x float> %[[PHI]], %[[VEC_LOAD1]]
	; CHECK: %[[VEC_LOAD2:.*]] = load <8 x float>, <8 x float>			; CHECK-VF8UF1: %[[VEC_LOAD2:.*]] = load <8 x float>, <8 x float>
	; CHECK: %[[VEC_FADD2]] = fadd <8 x float> %[[VEC_FADD1]], %[[VEC_LOAD2]]			; CHECK-VF8UF1: %[[VEC_FADD2]] = fadd <8 x float> %[[VEC_FADD1]], %[[VEC_LOAD2]]
	; CHECK: middle.block			; CHECK-VF8UF1: middle.block
	; CHECK: %[[RDX:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float -0.000000e+00, <8 x float> %[[VEC_FADD2]])			; CHECK-VF8UF1: %[[RDX:.*]] = call float @llvm.vector.reduce.fadd.v8f32(float -0.000000e+00, <8 x float> %[[VEC_FADD2]])
	; CHECK: for.body			; CHECK-VF8UF1: for.body
	; CHECK: %[[SUM:.]] = phi float [ %bc.merge.rdx, %scalar.ph ], [ %[[FADD2:.]], %for.body ]			; CHECK-VF8UF1: %[[SUM:.]] = phi float [ %bc.merge.rdx, %scalar.ph ], [ %[[FADD2:.]], %for.body ]
	; CHECK: %[[LOAD1:.]] = load float, float			; CHECK-VF8UF1: %[[LOAD1:.]] = load float, float
	; CHECK: %[[FADD1:.*]] = fadd float %sum, %[[LOAD1]]			; CHECK-VF8UF1: %[[FADD1:.*]] = fadd float %sum, %[[LOAD1]]
	; CHECK: %[[LOAD2:.]] = load float, float			; CHECK-VF8UF1: %[[LOAD2:.]] = load float, float
	; CHECK: %[[FADD2]] = fadd float %[[FADD1]], %[[LOAD2]]			; CHECK-VF8UF1: %[[FADD2]] = fadd float %[[FADD1]], %[[LOAD2]]
	; CHECK: for.end			; CHECK-VF8UF1: for.end
	; CHECK: %[[RET:.*]] = phi float [ %[[FADD2]], %for.body ], [ %[[RDX]], %middle.block ]			; CHECK-VF8UF1: %[[RET:.*]] = phi float [ %[[FADD2]], %for.body ], [ %[[RDX]], %middle.block ]
	; CHECK: ret float %[[RET]]			; CHECK-VF8UF1: ret float %[[RET]]
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum = phi float [ -0.000000e+00, %entry ], [ %add3, %for.body ]			%sum = phi float [ -0.000000e+00, %entry ], [ %add3, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%add = fadd float %sum, %0			%add = fadd float %sum, %0
	%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv			%arrayidx2 = getelementptr inbounds float, float* %b, i64 %iv
	%1 = load float, float* %arrayidx2, align 4			%1 = load float, float* %arrayidx2, align 4
	%add3 = fadd float %add, %1			%add3 = fadd float %add, %1
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0			br i1 %exitcond.not, label %for.end, label %for.body

	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	%rdx = phi float [ %add3, %for.body ]			%rdx = phi float [ %add3, %for.body ]
	ret float %rdx			ret float %rdx
	}			}

	!0 = distinct !{!0, !4, !7, !9}			; Tests with both a floating point reduction & induction, e.g.
	!1 = distinct !{!1, !4, !8, !9}			;
	!2 = distinct !{!2, !5, !7, !9}			;float fp_iv_rdx_loop(float values, float init, float __restrict__ A, int N) {
	!3 = distinct !{!3, !6, !7, !9, !10}			; float fp_inc = 2.0;
	!4 = !{!"llvm.loop.vectorize.width", i32 8}			; float x = init;
	!5 = !{!"llvm.loop.vectorize.width", i32 4}			; float sum = 0.0;
	!6 = !{!"llvm.loop.vectorize.width", i32 2}			; for (int i=0; i < N; ++i) {
	!7 = !{!"llvm.loop.interleave.count", i32 1}			; A[i] = x;
	!8 = !{!"llvm.loop.interleave.count", i32 4}			; x += fp_inc;
	!9 = !{!"llvm.loop.vectorize.enable", i1 true}			; sum += values[i];
	!10 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}			; }
				; return sum;
				;}

				; Test which includes a reduction which could be performed in-loop, but which also has an FP induction
				; variable which cannot be vectorized.
				sdesmalenUnsubmitted Done Reply Inline Actions nit: ; Strict reduction could be performed in-loop, but ordered FP induction variables are not supported. sdesmalen: nit: ; Strict reduction could be performed in-loop, but ordered FP induction variables are…
				; CHECK-REMARKS: loop not vectorized: cannot prove it is safe to reorder floating-point operations
				define float @induction_and_reduction(float* nocapture readonly %values, float %init, float* noalias nocapture %A, i64 %N) {
				; CHECK-NO-HINTS-LABEL: @induction_and_reduction
				; CHECK-NO-HINTS-NOT: vector.body
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.015 = phi float [ 0.000000e+00, %entry ], [ %add3, %for.body ]
				%x.014 = phi float [ %init, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %A, i64 %iv
				store float %x.014, float* %arrayidx, align 4
				%add = fadd float %x.014, 2.000000e+00
				%arrayidx2 = getelementptr inbounds float, float* %values, i64 %iv
				%0 = load float, float* %arrayidx2, align 4
				%add3 = fadd float %sum.015, %0
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %N
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end:
				%add3.lcssa = phi float [ %add3, %for.body ]
				sdesmalenUnsubmitted Done Reply Inline Actions nit: this PHI is unnecessary? (same for the tests below) sdesmalen: nit: this PHI is unnecessary? (same for the tests below)
				ret float %add3.lcssa
				}

				; As above, but the floating-point induction is 'fast'
				sdesmalenUnsubmitted Done Reply Inline Actions Can you be more explicit in the comment? i.e. ; As above, but with the FP induction being unordered (fast), the loop can be vectorized. sdesmalen: Can you be more explicit in the comment? i.e. ; As above, but with the FP induction being…
				; CHECK-REMARKS: vectorized loop (vectorization width: 4, interleaved count: 2)
				define float @fast_induction_and_reduction(float* nocapture readonly %values, float %init, float* noalias nocapture %A, i64 %N) {
				; CHECK-NO-HINTS-LABEL: @fast_induction_and_reduction
				; CHECK-NO-HINTS: vector.ph
				; CHECK-NO-HINTS: %[[INDUCTION:.]] = fadd fast <4 x float> {{.}}, <float 0.000000e+00, float 2.000000e+00, float 4.000000e+00, float 6.000000e+00>
				; CHECK-NO-HINTS: vector.body
				; CHECK-NO-HINTS: %[[RDX_PHI:.]] = phi float [ 0.000000e+00, %vector.ph ], [ %[[FADD2:.]], %vector.body ]
				; CHECK-NO-HINTS: %[[IND_PHI:.]] = phi <4 x float> [ %[[INDUCTION]], %vector.ph ], [ %[[VEC_IND_NEXT:.]], %vector.body ]
				; CHECK-NO-HINTS: %[[STEP_ADD:.*]] = fadd fast <4 x float> %[[IND_PHI]], <float 8.000000e+00, float 8.000000e+00, float 8.000000e+00, float 8.000000e+00>
				; CHECK-NO-HINTS: %[[LOAD1:.]] = load <4 x float>, <4 x float>
				; CHECK-NO-HINTS: %[[LOAD2:.]] = load <4 x float>, <4 x float>
				; CHECK-NO-HINTS: %[[FADD1:.*]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[RDX_PHI]], <4 x float> %[[LOAD1]])
				; CHECK-NO-HINTS: %[[FADD2]] = call float @llvm.vector.reduce.fadd.v4f32(float %[[FADD1]], <4 x float> %[[LOAD2]])
				; CHECK-NO-HINTS: %[[VEC_IND_NEXT]] = fadd fast <4 x float> %[[STEP_ADD]], <float 8.000000e+00, float 8.000000e+00, float 8.000000e+00, float 8.000000e+00>
				; CHECK-NO-HINTS: for.body
				; CHECK-NO-HINTS: %[[RDX_SUM_PHI:.]] = phi float [ {{.}}, %scalar.ph ], [ %[[FADD3:.*]], %for.body ]
				; CHECK-NO-HINTS: %[[IND_SUM_PHI:.]] = phi fast float [ {{.}}, %scalar.ph ], [ %[[ADD_IND:.*]], %for.body ]
				; CHECK-NO-HINTS: store float %[[IND_SUM_PHI]], float*
				; CHECK-NO-HINTS: %[[ADD_IND]] = fadd fast float %[[IND_SUM_PHI]], 2.000000e+00
				; CHECK-NO-HINTS: %[[LOAD3:.]] = load float, float
				; CHECK-NO-HINTS: %[[FADD3]] = fadd float %[[RDX_SUM_PHI]], %[[LOAD3]]
				; CHECK-NO-HINTS: for.end
				; CHECK-NO-HINTS: %[[RES_PHI:.*]] = phi float [ %[[FADD3]], %for.body ], [ %[[FADD2]], %middle.block ]
				; CHECK-NO-HINTS: ret float %[[RES_PHI]]
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum.015 = phi float [ 0.000000e+00, %entry ], [ %add3, %for.body ]
				%x.014 = phi fast float [ %init, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %A, i64 %iv
				store float %x.014, float* %arrayidx, align 4
				%add = fadd fast float %x.014, 2.000000e+00
				%arrayidx2 = getelementptr inbounds float, float* %values, i64 %iv
				%0 = load float, float* %arrayidx2, align 4
				%add3 = fadd float %sum.015, %0
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %N
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end:
				%add3.lcssa = phi float [ %add3, %for.body ]
				ret float %add3.lcssa
				}

				; The FP induction is fast, but here we can't vectorize as only one of the reductions is an FAdd that can be performed in-loop
				; CHECK-REMARKS: loop not vectorized: cannot prove it is safe to reorder floating-point operations
				define float @fast_induction_unordered_reduction(float* nocapture readonly %values, float %init, float* noalias nocapture %A, float* noalias nocapture %B, i64 %N) {
				; CHECK-LABEL: @fast_induction_unordered_reduction
				david-armUnsubmitted Not Done Reply Inline Actions Are these new tests missing hints that the other tests seem to use? I just wondered if it was better to be consistent here that's all. The reason I mention this is because I was expecting the UNORDERED case to vectorise due to the `-hints-allow-reordering=true` flag. david-arm: Are these new tests missing hints that the other tests seem to use? I just wondered if it was…
				david-armUnsubmitted Not Done Reply Inline Actions I think I see now @kmclaughlin - you're testing the productisation of `-enable-strict-reductions` so you were adding some tests deliberately without hints, which also makes sense. In this case I'd also be happy if you left these tests as they are and just added some comments explaining why we expect the CHECK-UNORDERED case to not vectorise. david-arm: I think I see now @kmclaughlin - you're testing the productisation of `-enable-strict…
				kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Hi @david-arm, I've added a comment above these tests to explain why the CHECK-UNORDERED case shouldn't vectorize. kmclaughlin: Hi @david-arm, I've added a comment above these tests to explain why the CHECK-UNORDERED case…
				; CHECK-NOT: vector.body
				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%sum2.023 = phi float [ 3.000000e+00, %entry ], [ %mul, %for.body ]
				%sum.022 = phi float [ 0.000000e+00, %entry ], [ %add3, %for.body ]
				%x.021 = phi float [ %init, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %A, i64 %iv
				store float %x.021, float* %arrayidx, align 4
				%add = fadd fast float %x.021, 2.000000e+00
				%arrayidx2 = getelementptr inbounds float, float* %values, i64 %iv
				%0 = load float, float* %arrayidx2, align 4
				%add3 = fadd float %sum.022, %0
				%mul = fmul float %sum2.023, %0
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, %N
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end:
				%add3.lcssa = phi float [ %add3, %for.body ]
				%mul.lcssa = phi float [ %mul, %for.body ]
				%add6 = fadd float %add3.lcssa, %mul.lcssa
				ret float %add6

				}

				!0 = distinct !{!0, !1}
				!1 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Enable strict reductions when allowReordering() returns falseClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 344786

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll

[LoopVectorize] Enable strict reductions when allowReordering() returns false
ClosedPublic