This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
3/3
IRBuilder.h
-
lib/
-
IR/
1
IRBuilder.cpp
-
Transforms/Vectorize/
-
Vectorize/
8/13
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1/2
sve-vector-reverse-mask4.ll
4/9
sve-vector-reverse.ll
3/3
vector-reverse-mask4.ll
4/4
vector-reverse.ll

Differential D95363

[SVE][LoopVectorize] Add support for scalable vectorization of loops with vector reverse
ClosedPublic

Authored by CarolineConcatto on Jan 25 2021, 7:22 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
fhahn
efriedma
frasercrmck
david-arm
kmclaughlin
c-rhodes

Commits

rG3c03635d5300: [SVE][LoopVectorize] Add support for scalable vectorization of loops with…

Summary

This patch adds support for reverse loop vectorization.
It is possible to vectorize the following loop:

for (int i = n-1; i >= 0; --i)
  a[i] = b[i] + 1.0;

with fixed or scalable vector.
The loop-vectorizer will use 'reverse' on the loads/stores to make
sure the lanes themselves are also handled in the right order.
This patch adds support for scalable vector on IRBuilder interface to
create a reverse vector. The IR function
CreateVectorReverse lowers to experimental.vector.reverse for scalable vector
and keedp the original behavior for fixed vector using shuffle reverse.

Depends on D94883

Depends on D95603

Depends on D95139

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

CarolineConcatto created this revision.Jan 25 2021, 7:22 AM

Herald added subscribers: NickHung, hiraditya, kristof.beyls, tschuett. · View Herald TranscriptJan 25 2021, 7:22 AM

CarolineConcatto requested review of this revision.Jan 25 2021, 7:22 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 25 2021, 7:22 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B86566: Diff 319005.Jan 25 2021, 7:22 AM

CarolineConcatto retitled this revision from [SVELoopVectorize] Add support for scalable vectorization of loops with vector reverse to [SVE][LoopVectorize] Add support for scalable vectorization of loops with vector reverse.Jan 25 2021, 7:27 AM

CarolineConcatto edited the summary of this revision. (Show Details)

CarolineConcatto added reviewers: sdesmalen, fhahn, efriedma, frasercrmck, david-arm, kmclaughlin, c-rhodes.

Herald added a subscriber: psnobl. · View Herald TranscriptJan 25 2021, 7:27 AM

CarolineConcatto updated this revision to Diff 319010.Jan 25 2021, 7:31 AM

-s/scalabe/scalable

CarolineConcatto edited the summary of this revision. (Show Details)Jan 25 2021, 7:31 AM

Harbormaster completed remote builds in B86570: Diff 319010.Jan 25 2021, 7:32 AM

CarolineConcatto edited the summary of this revision. (Show Details)Jan 25 2021, 7:49 AM

CarolineConcatto edited the summary of this revision. (Show Details)

CarolineConcatto edited the summary of this revision. (Show Details)Jan 25 2021, 7:52 AM

CarolineConcatto added a parent revision: D94883: [CodeGen][SelectionDAG]Add new intrinsic experimental.vector.reverse.

david-arm added inline comments.Jan 25 2021, 8:45 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
451 ↗	(On Diff #319010)	Hi Carol, is the reason why you bail out here because we are casting between at least one illegal type? If so, perhaps you could add a TODO here, something like: // TODO : For scalable vectors we avoid calling the BaseT version for now because it doesn't yet work for illegal vector types as it assumes these are fixed width.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2853	Hi Carol. I don't think this is right here. The number of cached lanes in my patch (D95139) simply refers to the number of values we have managed to cache. It's just an optimisation the vectoriser uses to avoid constantly calculating the same lane index expressions. However, what you're after here is a runtime calculation for the last lane index - you can use getRuntimeVF in my patch for this. I think the code below will probably need splitting out into fixed width vs scalable cases, i.e. Builder.getInt32(1 - NumElts) would need to be something like: Builder.CreateSub(Builder.getInt32(1), getRuntimeVF()) for scalable vectors. Hope that makes sense!

sdesmalen added inline comments.Jan 25 2021, 8:46 AM

llvm/include/llvm/IR/IRBuilder.h
2452	It would be nice if the IRBuilder interface supports both the scalable and fixed-width case, i.e. for scalable vectors it creates the intrinsic, for fixed-width vectors it creates a shufflevector. Then in the LoopVectorizer it can just call `CreateVectorReverse(Vec)` and get the same behaviour.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
451–456 ↗	(On Diff #319010)	I don't think you should return just some random value here. Is there anything in broken in BasicTTIImpl that needs fixing? If you run into any asserts caused by the cast from FP -> Int, you can simplify the test so that it doesn't need that cast (and then fix this up in a separate patch)
1190 ↗	(On Diff #319010)	Can you instead add cases to the table? e.g. { TTI::SK_Reverse, MVT::nxv16i8, 1 }, { TTI::SK_Reverse, MVT::nxv8i16, 1 }, ...
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2852–2853	`NumElts` can't be `unsigned` here, because it has to be a value that's evaluated at runtime. This means it needs to multiply `VF.getKnownMinValue` by the runtime value of VScale, making `NumElts` a `Value `. For the first GEP you can also do the arithmetic on a pointer of the vector type, and index by `-Part`, e.g. getelementptr <vscale x 8 x double>, <vscale x 8 x double> %ptr, i64 -<part> <=> %vscale = call i64 @llvm.vscale() %part = mul i64 %vscale, 8 getelementptr double, double %ptr, i64 %part The second GEP indexes a specific element, so that would still need to be a getelementptr on a `double`.
llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll
24	It is difficult to verify what this code is generating, and the GEPs here are actually really important to get right, so can you just check for explicit values?

fhahn added inline comments.Jan 25 2021, 8:58 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1191 ↗	(On Diff #319010)	Also, can this have a separate cost model test? If so, can this be split off as a separate improvement?

Move cost function for Shuffle reverse to D93639
Use getRunTimeVF to compute the runtime vector size for scalable vector
add tests for fixed vector

Harbormaster completed remote builds in B88222: Diff 321998.Feb 7 2021, 10:42 AM

CarolineConcatto edited the summary of this revision. (Show Details)Feb 7 2021, 10:42 AM

CarolineConcatto added a parent revision: D95603: [CostModel]Add cost model for experimental.vector.reverse.

CarolineConcatto edited the summary of this revision. (Show Details)

Thank you all for the review.
I believe I had addressed your comments.
I've created another patch to compute the cost for vector reverse with scalable vector.
I've removed the cast from int to double in one of the tests, so we don't need cast fixed for this feature.
And the IR builder now has a reverse function for fixed and scalable vectors.

llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll
24	I don't know if I understood what you want. I had to change the checks. Do you mind to show what do you like to see being checked in these tests?

david-arm added inline comments.Feb 8 2021, 2:49 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2870	Maybe better to use `Builder.getInt32(-Part)` to be consistent with how it's done for the fixed width case?
llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll
15	If you use a i64 variable here then it makes the tests simpler as the variable doesn't need sign-extending with `sext`
22	Hi @CarolineConcatto something doesn't look right here because the result will be zero.
51	See comment about `N` - this line can be killed if you have "i64 N", then everywhere that uses %0 can just use %N instead. If you do this then the `for.body.preheader` block can be killed too and the branch above can jump straight to `for.body`
71	Same comment as in previous function.

CarolineConcatto edited the summary of this revision. (Show Details)Feb 9 2021, 9:05 AM

CarolineConcatto edited the summary of this revision. (Show Details)

CarolineConcatto edited the summary of this revision. (Show Details)Feb 9 2021, 9:08 AM

david-arm added inline comments.Feb 10 2021, 6:02 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll
22	I think I may have made a mistake here and the '0' is there because this is the 0th Part. Looking at the way GEPs are created it doesn't really optimise the case when Part=0. Adding "-instcombine" to the RUN line would clear up any dead code here.

change how to get NumElt inside LoopVectorize
change the index to be 64 bits

Harbormaster completed remote builds in B88690: Diff 322781.Feb 10 2021, 12:08 PM

CarolineConcatto marked an inline comment as done.Feb 10 2021, 12:16 PM

CarolineConcatto added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2870	Thank you David, you are correct things should be coherent.
llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll
51	@david-arm I've replaced the index to be 64 bits, but I could not remove the for.body.preheader.

-removing redundant lines in the test

Harbormaster completed remote builds in B88784: Diff 322951.Feb 11 2021, 4:55 AM

change index 'i' in the tests to be 64 bits instead of 32bits

Harbormaster completed remote builds in B88790: Diff 322962.Feb 11 2021, 5:37 AM

Thank you @david-arm.
I've removed the pre.header.
It is indeed better.

Hi @CarolineConcatto, it's looking much better now and thanks for dealing with all the review comments!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2869	Hi @CarolineConcatto, I just realised there isn't a test for this for scalable vectors. Would you be able to add a test for blocks with masks too? I think it should be something like: void foo(int * __restrict__ a, int * __restrict__ cond, long long N) { for (int i = N - 1; i >= 0; i--) { if (cond[i]) a[i] += 1; } }
llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse.ll
23	I think that since you're adding tests for this it's probably a good idea to expand all the shufflevector CHECK lines in this file to make sure we have a reverse mask here too?

Matt added a subscriber: Matt.Feb 16 2021, 9:05 AM

bin.cheng-ali added a subscriber: bin.cheng-ali.Feb 19 2021, 8:55 AM

bin.cheng-ali added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2882	Sorry for one nitpicking. IIUC the difference for scalable/fixed cases is the two operands of GEPs, is it better to factor out common code by doing below? if (VF.isScalable()) { // build GEP operands for scalable case } else { // build GEP operands for fixed case } // common code building VecPtr with above GEP operands.

david-arm added inline comments.Feb 22 2021, 3:13 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2882	I guess alternatively we could commonise them into one block where we always calculate Value *LastLane = Builder.CreateSub(Builder.getInt32(1), getRuntimeVF(Builder, Builder.getInt32Ty(), VF)); for both fixed-width and scalable vectors, and just let instcombine and other passes fold the constants? I'm not sure if that breaks lots of tests though ...

HLJ2009 added a subscriber: HLJ2009.Feb 23 2021, 7:55 PM

CarolineConcatto mentioned this in D97758: [CostModel][SVE] Add cost model for shuffle reverse with i1 and scalable vector.Mar 2 2021, 1:36 AM

CarolineConcatto mentioned this in rGf2b749be1568: [CostModel][SVE] Add cost model for shuffle reverse with i1 and scalable vector.Mar 4 2021, 10:54 AM

Add tests for when vector.reverse needs to reverse a mask
improve code in LoopVectorize.cpp
C&P function getRunTimeVF from D95139

Harbormaster completed remote builds in B92254: Diff 328434.Mar 5 2021, 1:23 AM

Thank you all for the review.
I C&P the function getRunTime from D95139.
If D95139 is merged before this patch I will remove it.
All the other dependencies are already in main, and getRunTimeVF is the only function that this patch needs from D95139.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2869	Thank you @david-arm for pointing that. Yes, indeed it was missing test for that. The cost model for i1 was missing with vector reverse.
2882	So @david-arm 's suggestion works for fixed width and scalable vectors. At least no test has failed. I could also have done the way @bin.cheng-ali suggested: if (scalable){ NumElts = Builder.CreateMul(Builder.getInt32(1), getRuntimeVF(Builder, Builder.getInt32Ty(), VF)); LastLane = Builder.CreateSub(Builder.getInt32(1), getRuntimeVF(Builder, Builder.getInt32Ty(), VF)); }else{ NumElts = 1 * VF.getKnownMinValue() LastLane = 1 - VF.getKnownMinValue() } But as all work with getRunTime I thought it was code duplication and this way the code looks simpler. I added a comment to explain how it works if it is fixed-width vector for me in the future too

CarolineConcatto added inline comments.Mar 5 2021, 1:42 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll
15	Hi @david-arm I think I forgot to answer that suggestion. This file has 2 tests, for 2 variable types, double and integer variables @vector_reverse_f64(i64 %N, double* %a, double* %b) @vector_reverse_i64(i64 %N, i64* %a, i64* %b) So I believe it is ok to leave this as it is. Unless I am missing something. (Which is completely ok for me) I can see for @vector_reverse_i64 that sext instruction is gone.

david-arm added inline comments.Mar 5 2021, 3:32 AM

llvm/include/llvm/IR/IRBuilder.h
2447	nit: This function looks a bit too complex to live in a header - maybe better defined in the .cpp file?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1110	nit: One of my patches has already landed with this function so I think you'll need to remove it when committing.
2860	nit: Maybe instead of calling getRuntimeVF twice you can call it once and assign it to a variable?
llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll
25	nit: Perhaps better renamed to `nxv4i1` since that's what the generated code looks like?
llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll
25	nit: Perhaps better renamed to `v4i1` since the CHECK lines have that type?
28	This seems a bit less well tested than the sve version - perhaps it's worth adding CHECK lines for the store too? I assume the store will reuse the same mask?
llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse.ll
30	nit: %[[REVERSE6]]?
65	nit: %[[REVERSE6]]

Fix nit in the .ll tests and LoopVectorize.cpp
Remote copied getRunTimeVF from LoopVectorize.cpp
Move CreateVectorReverse implementation from IRBuilder.h to IRBuilder.cpp

Hey @david-arm sorry for the nit in the tests.
Usually happens when I do copy and paste many times.
I've updated the patch and rebased it with the main.

Carol

llvm/include/llvm/IR/IRBuilder.h
2447	Thank you David. I hadn't noticed that this could be also implemented in IRBuilder.cpp.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1110	Wonderfull! Good to know!
2860	Good idea David, so I can reduce code duplication
llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll
25	That is what happens when I do C&P. I've creates tests for all masks sizes, but only submitted one size. I was wondering, I did not add tests for all mask sizes here. Do you think it is valid? I guess that one size fits all, if one mask size does not break it means that the rest will not break too.
llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll
28	Ok, I think it is valid to have similar check's for scalable and fixed vectors
llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse.ll
30	Good catch!

CarolineConcatto marked 5 inline comments as done.Mar 8 2021, 5:54 AM

Harbormaster completed remote builds in B92640: Diff 328984.Mar 8 2021, 6:57 AM

LGTM!

llvm/lib/IR/IRBuilder.cpp
996	nit: I think in this case it's worth fixing the clang-tidy comment. It looks like the function above also uses "auto *..."

This revision is now accepted and ready to land.Mar 9 2021, 3:10 AM

david-arm mentioned this in D98506: [SVE][LoopVectorize] Verify support for vectorizing loops with invariant loads.Mar 15 2021, 9:25 AM

This revision was landed with ongoing or failed builds.Mar 16 2021, 2:19 AM

Closed by commit rG3c03635d5300: [SVE][LoopVectorize] Add support for scalable vectorization of loops with… (authored by CarolineConcatto). · Explain Why

This revision was automatically updated to reflect the committed changes.

CarolineConcatto added a commit: rG3c03635d5300: [SVE][LoopVectorize] Add support for scalable vectorization of loops with….

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IRBuilder.h

3 lines

lib/

IR/

IRBuilder.cpp

16 lines

Transforms/

Vectorize/

LoopVectorize.cpp

26 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-vector-reverse-mask4.ll

68 lines

sve-vector-reverse.ll

108 lines

vector-reverse-mask4.ll

65 lines

vector-reverse.ll

91 lines

Diff 330905

llvm/include/llvm/IR/IRBuilder.h

Show First 20 Lines • Show All 2,438 Lines • ▼ Show 20 Lines	public:

/// Create a unary shuffle. The second vector operand of the IR instruction		/// Create a unary shuffle. The second vector operand of the IR instruction
/// is poison.		/// is poison.
Value CreateShuffleVector(Value V, ArrayRef<int> Mask,		Value CreateShuffleVector(Value V, ArrayRef<int> Mask,
const Twine &Name = "") {		const Twine &Name = "") {
return CreateShuffleVector(V, PoisonValue::get(V->getType()), Mask, Name);		return CreateShuffleVector(V, PoisonValue::get(V->getType()), Mask, Name);
}		}

Value CreateExtractValue(Value Agg,		Value CreateExtractValue(Value Agg,
		david-armUnsubmitted Done Reply Inline Actions nit: This function looks a bit too complex to live in a header - maybe better defined in the .cpp file? david-arm: nit: This function looks a bit too complex to live in a header - maybe better defined in the .
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Thank you David. I hadn't noticed that this could be also implemented in IRBuilder.cpp. CarolineConcatto: Thank you David. I hadn't noticed that this could be also implemented in IRBuilder.cpp.
ArrayRef<unsigned> Idxs,		ArrayRef<unsigned> Idxs,
const Twine &Name = "") {		const Twine &Name = "") {
if (auto *AggC = dyn_cast<Constant>(Agg))		if (auto *AggC = dyn_cast<Constant>(Agg))
return Insert(Folder.CreateExtractValue(AggC, Idxs), Name);		return Insert(Folder.CreateExtractValue(AggC, Idxs), Name);
return Insert(ExtractValueInst::Create(Agg, Idxs), Name);		return Insert(ExtractValueInst::Create(Agg, Idxs), Name);
		sdesmalenUnsubmitted Done Reply Inline Actions It would be nice if the IRBuilder interface supports both the scalable and fixed-width case, i.e. for scalable vectors it creates the intrinsic, for fixed-width vectors it creates a shufflevector. Then in the LoopVectorizer it can just call `CreateVectorReverse(Vec)` and get the same behaviour. sdesmalen: It would be nice if the IRBuilder interface supports both the scalable and fixed-width case, i.
}		}

Value CreateInsertValue(Value Agg, Value *Val,		Value CreateInsertValue(Value Agg, Value *Val,
ArrayRef<unsigned> Idxs,		ArrayRef<unsigned> Idxs,
const Twine &Name = "") {		const Twine &Name = "") {
if (auto *AggC = dyn_cast<Constant>(Agg))		if (auto *AggC = dyn_cast<Constant>(Agg))
if (auto *ValC = dyn_cast<Constant>(Val))		if (auto *ValC = dyn_cast<Constant>(Val))
return Insert(Folder.CreateInsertValue(AggC, ValC, Idxs), Name);		return Insert(Folder.CreateInsertValue(AggC, ValC, Idxs), Name);
Show All 38 Lines	public:
/// address space before call and casted back to Ptr type after call.		/// address space before call and casted back to Ptr type after call.
Value CreateLaunderInvariantGroup(Value Ptr);		Value CreateLaunderInvariantGroup(Value Ptr);

/// \brief Create a strip.invariant.group intrinsic call. If Ptr type is		/// \brief Create a strip.invariant.group intrinsic call. If Ptr type is
/// different from pointer to i8, it's casted to pointer to i8 in the same		/// different from pointer to i8, it's casted to pointer to i8 in the same
/// address space before call and casted back to Ptr type after call.		/// address space before call and casted back to Ptr type after call.
Value CreateStripInvariantGroup(Value Ptr);		Value CreateStripInvariantGroup(Value Ptr);

		/// Return a vector value that contains the vector V reversed
		Value CreateVectorReverse(Value V, const Twine &Name = "");

/// Return a vector value that contains \arg V broadcasted to \p		/// Return a vector value that contains \arg V broadcasted to \p
/// NumElts elements.		/// NumElts elements.
Value CreateVectorSplat(unsigned NumElts, Value V, const Twine &Name = "");		Value CreateVectorSplat(unsigned NumElts, Value V, const Twine &Name = "");

/// Return a vector value that contains \arg V broadcasted to \p		/// Return a vector value that contains \arg V broadcasted to \p
/// EC elements.		/// EC elements.
Value CreateVectorSplat(ElementCount EC, Value V, const Twine &Name = "");		Value CreateVectorSplat(ElementCount EC, Value V, const Twine &Name = "");

▲ Show 20 Lines • Show All 132 Lines • Show Last 20 Lines

llvm/lib/IR/IRBuilder.cpp

Show First 20 Lines • Show All 986 Lines • ▼ Show 20 Lines	Value IRBuilderBase::CreateStripInvariantGroup(Value Ptr) {

CallInst *Fn = CreateCall(FnStripInvariantGroup, {Ptr});		CallInst *Fn = CreateCall(FnStripInvariantGroup, {Ptr});

if (PtrType != Int8PtrTy)		if (PtrType != Int8PtrTy)
return CreateBitCast(Fn, PtrType);		return CreateBitCast(Fn, PtrType);
return Fn;		return Fn;
}		}

		Value IRBuilderBase::CreateVectorReverse(Value V, const Twine &Name) {
		auto *Ty = cast<VectorType>(V->getType());
		david-armUnsubmitted Not Done Reply Inline Actions nit: I think in this case it's worth fixing the clang-tidy comment. It looks like the function above also uses "auto ..." david-arm:* nit: I think in this case it's worth fixing the clang-tidy comment. It looks like the function…
		if (isa<ScalableVectorType>(Ty)) {
		Module *M = BB->getParent()->getParent();
		Function *F = Intrinsic::getDeclaration(
		M, Intrinsic::experimental_vector_reverse, Ty);
		return Insert(CallInst::Create(F, V), Name);
		}
		// Keep the original behaviour for fixed vector
		SmallVector<int, 8> ShuffleMask;
		int NumElts = Ty->getElementCount().getKnownMinValue();
		for (int i = 0; i < NumElts; ++i)
		ShuffleMask.push_back(NumElts - i - 1);
		return CreateShuffleVector(V, ShuffleMask, Name);
		}

Value IRBuilderBase::CreateVectorSplat(unsigned NumElts, Value V,		Value IRBuilderBase::CreateVectorSplat(unsigned NumElts, Value V,
const Twine &Name) {		const Twine &Name) {
auto EC = ElementCount::getFixed(NumElts);		auto EC = ElementCount::getFixed(NumElts);
return CreateVectorSplat(EC, V, Name);		return CreateVectorSplat(EC, V, Name);
}		}

Value IRBuilderBase::CreateVectorSplat(ElementCount EC, Value V,		Value IRBuilderBase::CreateVectorSplat(ElementCount EC, Value V,
const Twine &Name) {		const Twine &Name) {
▲ Show 20 Lines • Show All 147 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,101 Lines • ▼ Show 20 Lines
static Value createStepForVF(IRBuilder<> &B, Constant Step, ElementCount VF) {		static Value createStepForVF(IRBuilder<> &B, Constant Step, ElementCount VF) {
assert(isa<ConstantInt>(Step) && "Expected an integer step");		assert(isa<ConstantInt>(Step) && "Expected an integer step");
Constant *StepVal = ConstantInt::get(		Constant *StepVal = ConstantInt::get(
Step->getType(),		Step->getType(),
cast<ConstantInt>(Step)->getSExtValue() * VF.getKnownMinValue());		cast<ConstantInt>(Step)->getSExtValue() * VF.getKnownMinValue());
return VF.isScalable() ? B.CreateVScale(StepVal) : StepVal;		return VF.isScalable() ? B.CreateVScale(StepVal) : StepVal;
}		}

namespace llvm {		namespace llvm {
		david-armUnsubmitted Not Done Reply Inline Actions nit: One of my patches has already landed with this function so I think you'll need to remove it when committing. david-arm: nit: One of my patches has already landed with this function so I think you'll need to remove…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Wonderfull! Good to know! CarolineConcatto: Wonderfull! Good to know!

/// Return the runtime value for VF.		/// Return the runtime value for VF.
Value getRuntimeVF(IRBuilder<> &B, Type Ty, ElementCount VF) {		Value getRuntimeVF(IRBuilder<> &B, Type Ty, ElementCount VF) {
Constant *EC = ConstantInt::get(Ty, VF.getKnownMinValue());		Constant *EC = ConstantInt::get(Ty, VF.getKnownMinValue());
return VF.isScalable() ? B.CreateVScale(EC) : EC;		return VF.isScalable() ? B.CreateVScale(EC) : EC;
}		}

void reportVectorizationFailure(const StringRef DebugMsg,		void reportVectorizationFailure(const StringRef DebugMsg,
▲ Show 20 Lines • Show All 1,444 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::packScalarIntoVectorValue(VPValue *Def,
VectorValue = Builder.CreateInsertElement(		VectorValue = Builder.CreateInsertElement(
VectorValue, ScalarInst,		VectorValue, ScalarInst,
Instance.Lane.getAsRuntimeExpr(State.Builder, VF));		Instance.Lane.getAsRuntimeExpr(State.Builder, VF));
State.set(Def, VectorValue, Instance.Part);		State.set(Def, VectorValue, Instance.Part);
}		}

Value InnerLoopVectorizer::reverseVector(Value Vec) {		Value InnerLoopVectorizer::reverseVector(Value Vec) {
assert(Vec->getType()->isVectorTy() && "Invalid type");		assert(Vec->getType()->isVectorTy() && "Invalid type");
assert(!VF.isScalable() && "Cannot reverse scalable vectors");		return Builder.CreateVectorReverse(Vec, "reverse");
SmallVector<int, 8> ShuffleMask;
for (unsigned i = 0; i < VF.getKnownMinValue(); ++i)
ShuffleMask.push_back(VF.getKnownMinValue() - i - 1);

return Builder.CreateShuffleVector(Vec, ShuffleMask, "reverse");
}		}

// Return whether we allow using masked interleave-groups (for dealing with		// Return whether we allow using masked interleave-groups (for dealing with
// strided loads/stores that reside in predicated blocks, or for dealing		// strided loads/stores that reside in predicated blocks, or for dealing
// with gaps).		// with gaps).
static bool useMaskedInterleavedAccesses(const TargetTransformInfo &TTI) {		static bool useMaskedInterleavedAccesses(const TargetTransformInfo &TTI) {
// If an override option has been passed in for interleaved accesses, use it.		// If an override option has been passed in for interleaved accesses, use it.
if (EnableMaskedInterleavedMemAccesses.getNumOccurrences() > 0)		if (EnableMaskedInterleavedMemAccesses.getNumOccurrences() > 0)
▲ Show 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::vectorizeMemoryInstruction(

const auto CreateVecPtr = [&](unsigned Part, Value Ptr) -> Value {		const auto CreateVecPtr = [&](unsigned Part, Value Ptr) -> Value {
// Calculate the pointer for the specific unroll-part.		// Calculate the pointer for the specific unroll-part.
GetElementPtrInst *PartPtr = nullptr;		GetElementPtrInst *PartPtr = nullptr;

bool InBounds = false;		bool InBounds = false;
if (auto *gep = dyn_cast<GetElementPtrInst>(Ptr->stripPointerCasts()))		if (auto *gep = dyn_cast<GetElementPtrInst>(Ptr->stripPointerCasts()))
InBounds = gep->isInBounds();		InBounds = gep->isInBounds();

if (Reverse) {		if (Reverse) {
assert(!VF.isScalable() &&
"Reversing vectors is not yet supported for scalable vectors.");

// If the address is consecutive but reversed, then the		// If the address is consecutive but reversed, then the
		sdesmalenUnsubmitted Done Reply Inline Actions `NumElts` can't be `unsigned` here, because it has to be a value that's evaluated at runtime. This means it needs to multiply `VF.getKnownMinValue` by the runtime value of VScale, making `NumElts` a `Value `. For the first GEP you can also do the arithmetic on a pointer of the vector type, and index by `-Part`, e.g. getelementptr <vscale x 8 x double>, <vscale x 8 x double> %ptr, i64 -<part> <=> %vscale = call i64 @llvm.vscale() %part = mul i64 %vscale, 8 getelementptr double, double %ptr, i64 %part The second GEP indexes a specific element, so that would still need to be a getelementptr on a `double`. sdesmalen: `NumElts` can't be `unsigned` here, because it has to be a value that's evaluated at runtime.
		david-armUnsubmitted Done Reply Inline Actions Hi Carol. I don't think this is right here. The number of cached lanes in my patch (D95139) simply refers to the number of values we have managed to cache. It's just an optimisation the vectoriser uses to avoid constantly calculating the same lane index expressions. However, what you're after here is a runtime calculation for the last lane index - you can use getRuntimeVF in my patch for this. I think the code below will probably need splitting out into fixed width vs scalable cases, i.e. Builder.getInt32(1 - NumElts) would need to be something like: Builder.CreateSub(Builder.getInt32(1), getRuntimeVF()) for scalable vectors. Hope that makes sense! david-arm: Hi Carol. I don't think this is right here. The number of cached lanes in my patch (D95139)…
// wide store needs to start at the last vector element.		// wide store needs to start at the last vector element.
PartPtr = cast<GetElementPtrInst>(Builder.CreateGEP(		// RunTimeVF = VScale * VF.getKnownMinValue()
ScalarDataTy, Ptr, Builder.getInt32(-Part * VF.getKnownMinValue())));		// For fixed-width VScale is 1, then RunTimeVF = VF.getKnownMinValue()
		Value *RunTimeVF = getRuntimeVF(Builder, Builder.getInt32Ty(), VF);
		// NumElt = -Part * RunTimeVF
		Value *NumElt = Builder.CreateMul(Builder.getInt32(-Part), RunTimeVF);
		// LastLane = 1 - RunTimeVF
		david-armUnsubmitted Not Done Reply Inline Actions nit: Maybe instead of calling getRuntimeVF twice you can call it once and assign it to a variable? david-arm: nit: Maybe instead of calling getRuntimeVF twice you can call it once and assign it to a…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Good idea David, so I can reduce code duplication CarolineConcatto: Good idea David, so I can reduce code duplication
		Value *LastLane = Builder.CreateSub(Builder.getInt32(1), RunTimeVF);
		PartPtr =
		cast<GetElementPtrInst>(Builder.CreateGEP(ScalarDataTy, Ptr, NumElt));
PartPtr->setIsInBounds(InBounds);		PartPtr->setIsInBounds(InBounds);
PartPtr = cast<GetElementPtrInst>(Builder.CreateGEP(		PartPtr = cast<GetElementPtrInst>(
ScalarDataTy, PartPtr, Builder.getInt32(1 - VF.getKnownMinValue())));		Builder.CreateGEP(ScalarDataTy, PartPtr, LastLane));
PartPtr->setIsInBounds(InBounds);		PartPtr->setIsInBounds(InBounds);
if (isMaskRequired) // Reverse of a null all-one mask is a null mask.		if (isMaskRequired) // Reverse of a null all-one mask is a null mask.
BlockInMaskParts[Part] = reverseVector(BlockInMaskParts[Part]);		BlockInMaskParts[Part] = reverseVector(BlockInMaskParts[Part]);
		david-armUnsubmitted Not Done Reply Inline Actions Hi @CarolineConcatto, I just realised there isn't a test for this for scalable vectors. Would you be able to add a test for blocks with masks too? I think it should be something like: void foo(int * __restrict__ a, int * __restrict__ cond, long long N) { for (int i = N - 1; i >= 0; i--) { if (cond[i]) a[i] += 1; } } david-arm: Hi @CarolineConcatto, I just realised there isn't a test for this for scalable vectors. Would…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Thank you @david-arm for pointing that. Yes, indeed it was missing test for that. The cost model for i1 was missing with vector reverse. CarolineConcatto: Thank you @david-arm for pointing that. Yes, indeed it was missing test for that. The cost…
} else {		} else {
		david-armUnsubmitted Done Reply Inline Actions Maybe better to use `Builder.getInt32(-Part)` to be consistent with how it's done for the fixed width case? david-arm: Maybe better to use `Builder.getInt32(-Part)` to be consistent with how it's done for the fixed…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Thank you David, you are correct things should be coherent. CarolineConcatto: Thank you David, you are correct things should be coherent.
Value *Increment = createStepForVF(Builder, Builder.getInt32(Part), VF);		Value *Increment = createStepForVF(Builder, Builder.getInt32(Part), VF);
PartPtr = cast<GetElementPtrInst>(		PartPtr = cast<GetElementPtrInst>(
Builder.CreateGEP(ScalarDataTy, Ptr, Increment));		Builder.CreateGEP(ScalarDataTy, Ptr, Increment));
PartPtr->setIsInBounds(InBounds);		PartPtr->setIsInBounds(InBounds);
}		}

unsigned AddressSpace = Ptr->getType()->getPointerAddressSpace();		unsigned AddressSpace = Ptr->getType()->getPointerAddressSpace();
return Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace));		return Builder.CreateBitCast(PartPtr, DataTy->getPointerTo(AddressSpace));
};		};

// Handle Stores:		// Handle Stores:
if (SI) {		if (SI) {
		bin.cheng-aliUnsubmitted Not Done Reply Inline Actions Sorry for one nitpicking. IIUC the difference for scalable/fixed cases is the two operands of GEPs, is it better to factor out common code by doing below? if (VF.isScalable()) { // build GEP operands for scalable case } else { // build GEP operands for fixed case } // common code building VecPtr with above GEP operands. bin.cheng-ali: Sorry for one nitpicking. IIUC the difference for scalable/fixed cases is the two operands of…
		david-armUnsubmitted Not Done Reply Inline Actions I guess alternatively we could commonise them into one block where we always calculate Value LastLane = Builder.CreateSub(Builder.getInt32(1), getRuntimeVF(Builder, Builder.getInt32Ty(), VF)); for both fixed-width and scalable vectors, and just let instcombine and other passes fold the constants? I'm not sure if that breaks lots of tests though ... david-arm:* I guess alternatively we could commonise them into one block where we always calculate ```…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions So @david-arm 's suggestion works for fixed width and scalable vectors. At least no test has failed. I could also have done the way @bin.cheng-ali suggested: if (scalable){ NumElts = Builder.CreateMul(Builder.getInt32(1), getRuntimeVF(Builder, Builder.getInt32Ty(), VF)); LastLane = Builder.CreateSub(Builder.getInt32(1), getRuntimeVF(Builder, Builder.getInt32Ty(), VF)); }else{ NumElts = 1 * VF.getKnownMinValue() LastLane = 1 - VF.getKnownMinValue() } But as all work with getRunTime I thought it was code duplication and this way the code looks simpler. I added a comment to explain how it works if it is fixed-width vector for me in the future too CarolineConcatto: So @david-arm 's suggestion works for fixed width and scalable vectors. At least no test has…
setDebugLocFromInst(Builder, SI);		setDebugLocFromInst(Builder, SI);

for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Instruction *NewSI = nullptr;		Instruction *NewSI = nullptr;
Value *StoredVal = State.get(StoredValue, Part);		Value *StoredVal = State.get(StoredValue, Part);
if (CreateGatherScatter) {		if (CreateGatherScatter) {
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;		Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
Value *VectorGep = State.get(Addr, Part);		Value *VectorGep = State.get(Addr, Part);
▲ Show 20 Lines • Show All 7,033 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll

This file was added.

				; This is the loop in c++ being vectorize in this file with
				; experimental.vector.reverse

				;#pragma clang loop vectorize_width(4, scalable)
				; for (long int i = N - 1; i >= 0; i--)
				; {
				; if (cond[i])
				; a[i] += 1;
				; }

				; The test checks if the mask is being correctly created, reverted and used

				; RUN: opt -loop-vectorize -dce -instcombine -mtriple aarch64-linux-gnu -S < %s 2>%t \| FileCheck %s

				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning


				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"

				define void @vector_reverse_mask_nxv4i1(double* %a, double* %cond, i64 %N) #0 {
				; CHECK-LABEL: vector.body:
				david-armUnsubmitted Not Done Reply Inline Actions nit: Perhaps better renamed to `nxv4i1` since that's what the generated code looks like? david-arm: nit: Perhaps better renamed to `nxv4i1` since that's what the generated code looks like?
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions That is what happens when I do C&P. I've creates tests for all masks sizes, but only submitted one size. I was wondering, I did not add tests for all mask sizes here. Do you think it is valid? I guess that one size fits all, if one mask size does not break it means that the rest will not break too. CarolineConcatto: That is what happens when I do C&P. I've creates tests for all masks sizes, but only submitted…
				; CHECK: %[[REVERSE6:.]] = call <vscale x 4 x i1> @llvm.experimental.vector.reverse.nxv4i1(<vscale x 4 x i1> %{{.}})
				; CHECK: %[[WIDEMSKLOAD:.]] = call <vscale x 4 x double> @llvm.masked.load.nxv4f64.p0nxv4f64(<vscale x 4 x double> nonnull %{{.*}}, i32 8, <vscale x 4 x i1> %[[REVERSE6]], <vscale x 4 x double> poison)
				; CHECK-NEXT: %[[REVERSE7:.*]] = call <vscale x 4 x double> @llvm.experimental.vector.reverse.nxv4f64(<vscale x 4 x double> %[[WIDEMSKLOAD]])
				; CHECK-NEXT: %[[FADD:.*]] = fadd <vscale x 4 x double> %[[REVERSE7]]
				; CHECK-NEXT: %[[REVERSE8:.*]] = call <vscale x 4 x double> @llvm.experimental.vector.reverse.nxv4f64(<vscale x 4 x double> %[[FADD]])
				; CHECK: %[[REVERSE9:.]] = call <vscale x 4 x i1> @llvm.experimental.vector.reverse.nxv4i1(<vscale x 4 x i1> %{{.}})
				; CHECK: call void @llvm.masked.store.nxv4f64.p0nxv4f64(<vscale x 4 x double> %[[REVERSE8]], <vscale x 4 x double>* %{{.*}}, i32 8, <vscale x 4 x i1> %[[REVERSE9]]

				entry:
				%cmp7 = icmp sgt i64 %N, 0
				br i1 %cmp7, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup, %entry
				ret void

				for.body: ; preds = %for.body, %entry
				%i.08.in = phi i64 [ %i.08, %for.inc ], [ %N, %entry ]
				%i.08 = add nsw i64 %i.08.in, -1
				%arrayidx = getelementptr inbounds double, double* %cond, i64 %i.08
				%0 = load double, double* %arrayidx, align 8
				%tobool = fcmp une double %0, 0.000000e+00
				br i1 %tobool, label %if.then, label %for.inc

				if.then: ; preds = %for.body
				%arrayidx1 = getelementptr inbounds double, double* %a, i64 %i.08
				%1 = load double, double* %arrayidx1, align 8
				%add = fadd double %1, 1.000000e+00
				store double %add, double* %arrayidx1, align 8
				br label %for.inc

				for.inc: ; preds = %for.body, %if.then
				%cmp = icmp sgt i64 %i.08.in, 1
				br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0
				}

				attributes #0 = {"target-cpu"="generic" "target-features"="+neon,+sve"}


				!0 = distinct !{!0, !1, !2, !3, !4}
				!1 = !{!"llvm.loop.mustprogress"}
				!2 = !{!"llvm.loop.vectorize.width", i32 4}
				!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
				!4 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll

This file was added.

				; This is the loop in c++ being vectorize in this file with
				;experimental.vector.reverse
				; #pragma clang loop vectorize_width(8, scalable)
				; for (int i = N-1; i >= 0; --i)
				; a[i] = b[i] + 1.0;

				; RUN: opt -loop-vectorize -dce -instcombine -mtriple aarch64-linux-gnu -S < %s 2>%t \| FileCheck %s

				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning

				define void @vector_reverse_f64(i64 %N, double* %a, double* %b) #0{
				; CHECK-LABEL: @vector_reverse_f64
				david-armUnsubmitted Not Done Reply Inline Actions If you use a i64 variable here then it makes the tests simpler as the variable doesn't need sign-extending with `sext` david-arm: If you use a i64 variable here then it makes the tests simpler as the variable doesn't need…
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Hi @david-arm I think I forgot to answer that suggestion. This file has 2 tests, for 2 variable types, double and integer variables @vector_reverse_f64(i64 %N, double* %a, double* %b) @vector_reverse_i64(i64 %N, i64* %a, i64* %b) So I believe it is ok to leave this as it is. Unless I am missing something. (Which is completely ok for me) I can see for @vector_reverse_i64 that sext instruction is gone. CarolineConcatto: Hi @david-arm I think I forgot to answer that suggestion. This file has 2 tests, for 2…
				; CHECK-LABEL: vector.body:
				; CHECK: %[[ADD:.]] = add i64 %{{.}}, %N
				; CHECK-NEXT: %[[GEP:.]] = getelementptr inbounds double, double %b, i64 %[[ADD]]
				; CHECK-NEXT: %[[VSCALE:.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: %[[MUL:.*]] = mul i32 %[[VSCALE]], -8
				; CHECK-NEXT: %[[OR:.*]] = or i32 %[[MUL]], 1
				; CHECK-NEXT: %[[SEXT:.*]] = sext i32 %[[OR]] to i64
				david-armUnsubmitted Not Done Reply Inline Actions Hi @CarolineConcatto something doesn't look right here because the result will be zero. david-arm: Hi @CarolineConcatto something doesn't look right here because the result will be zero.
				david-armUnsubmitted Not Done Reply Inline Actions I think I may have made a mistake here and the '0' is there because this is the 0th Part. Looking at the way GEPs are created it doesn't really optimise the case when Part=0. Adding "-instcombine" to the RUN line would clear up any dead code here. david-arm: I think I may have made a mistake here and the '0' is there because this is the 0th Part.
				; CHECK-NEXT: %[[GEP1:.]] = getelementptr inbounds double, double %[[GEP]], i64 %[[SEXT]]
				; CHECK-NEXT: %[[CAST:.]] = bitcast double %[[GEP1]] to <vscale x 8 x double>*
				sdesmalenUnsubmitted Not Done Reply Inline Actions It is difficult to verify what this code is generating, and the GEPs here are actually really important to get right, so can you just check for explicit values? sdesmalen: It is difficult to verify what this code is generating, and the GEPs here are actually really…
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions I don't know if I understood what you want. I had to change the checks. Do you mind to show what do you like to see being checked in these tests? CarolineConcatto: I don't know if I understood what you want. I had to change the checks. Do you mind to show…
				; CHECK-NEXT: %[[WIDE:.]] = load <vscale x 8 x double>, <vscale x 8 x double> %[[CAST]], align 8
				; CHECK-NEXT: %[[REVERSE:.*]] = call <vscale x 8 x double> @llvm.experimental.vector.reverse.nxv8f64(<vscale x 8 x double> %[[WIDE]])
				; CHECK-NEXT: %[[FADD:.*]] = fadd <vscale x 8 x double> %[[REVERSE]], shufflevector
				; CHECK-NEXT: %[[GEP2:.]] = getelementptr inbounds double, double %a, i64 %[[ADD]]
				; CHECK-NEXT: %[[REVERSE6:.*]] = call <vscale x 8 x double> @llvm.experimental.vector.reverse.nxv8f64(<vscale x 8 x double> %[[FADD]])
				; CHECK-NEXT: %[[VSCALE1:.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: %[[MUL1:.*]] = mul i32 %[[VSCALE1]], -8
				; CHECK-NEXT: %[[OR1:.*]] = or i32 %[[MUL1]], 1
				; CHECK-NEXT: %[[SEXT1:.*]] = sext i32 %[[OR1]] to i64
				; CHECK-NEXT: %[[GEP3:.]] = getelementptr inbounds double, double %[[GEP2]], i64 %[[SEXT1]]
				; CHECK-NEXT: %[[CAST1:.]] = bitcast double %[[GEP3]] to <vscale x 8 x double>*
				; CHECK-NEXT: store <vscale x 8 x double> %[[REVERSE6]], <vscale x 8 x double>* %[[CAST1]], align 8

				entry:
				%cmp7 = icmp sgt i64 %N, 0
				br i1 %cmp7, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%i.08.in = phi i64 [ %i.08, %for.body ], [ %N, %entry ]
				%i.08 = add nsw i64 %i.08.in, -1
				%arrayidx = getelementptr inbounds double, double* %b, i64 %i.08
				%0 = load double, double* %arrayidx, align 8
				%add = fadd double %0, 1.000000e+00
				%arrayidx1 = getelementptr inbounds double, double* %a, i64 %i.08
				david-armUnsubmitted Not Done Reply Inline Actions See comment about `N` - this line can be killed if you have "i64 N", then everywhere that uses %0 can just use %N instead. If you do this then the `for.body.preheader` block can be killed too and the branch above can jump straight to `for.body` david-arm: See comment about `N` - this line can be killed if you have "i64 N", then everywhere that uses…
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions @david-arm I've replaced the index to be 64 bits, but I could not remove the for.body.preheader. CarolineConcatto: @david-arm I've replaced the index to be 64 bits, but I could not remove the for.body.
				store double %add, double* %arrayidx1, align 8
				%cmp = icmp sgt i64 %i.08.in, 1
				br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0
				}


				define void @vector_reverse_i64(i64 %N, i64* %a, i64* %b) #0 {
				; CHECK-LABEL: vector_reverse_i64
				; CHECK-LABEL: vector.body:
				; CHECK: %[[ADD:.]] = add i64 %{{.}}, %N
				; CHECK-NEXT: %[[GEP:.]] = getelementptr inbounds i64, i64 %b, i64 %[[ADD]]
				; CHECK-NEXT: %[[VSCALE:.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: %[[MUL:.*]] = mul i32 %[[VSCALE]], -8
				; CHECK-NEXT: %[[OR:.*]] = or i32 %[[MUL]], 1
				; CHECK-NEXT: %[[SEXT:.*]] = sext i32 %[[OR]] to i64
				; CHECK-NEXT: %[[GEP1:.]] = getelementptr inbounds i64, i64 %[[GEP]], i64 %[[SEXT]]
				; CHECK-NEXT: %[[CAST:.]] = bitcast i64 %[[GEP1]] to <vscale x 8 x i64>*
				; CHECK-NEXT: %[[WIDE:.]] = load <vscale x 8 x i64>, <vscale x 8 x i64> %[[CAST]], align 8
				; CHECK-NEXT: %[[REVERSE:.*]] = call <vscale x 8 x i64> @llvm.experimental.vector.reverse.nxv8i64(<vscale x 8 x i64> %[[WIDE]])
				; CHECK-NEXT: %[[ADD1:.*]] = add <vscale x 8 x i64> %[[REVERSE]]
				david-armUnsubmitted Done Reply Inline Actions Same comment as in previous function. david-arm: Same comment as in previous function.
				; CHECK-NEXT: %[[GEP2:.]] = getelementptr inbounds i64, i64 %a, i64 %[[ADD]]
				; CHECK-NEXT: %[[REVERSE6]] = call <vscale x 8 x i64> @llvm.experimental.vector.reverse.nxv8i64(<vscale x 8 x i64> %[[ADD1]])
				; CHECK-NEXT: %[[VSCALE:.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: %[[MUL1:.*]] = mul i32 %[[VSCALE]], -8
				; CHECK-NEXT: %[[OR1:.*]] = or i32 %[[MUL1]], 1
				; CHECK-NEXT: %[[SEXT1:.*]] = sext i32 %[[OR1]] to i64
				; CHECK-NEXT: %[[GEP3:.]] = getelementptr inbounds i64, i64 %[[GEP2]], i64 %[[SEXT1]]
				; CHECK-NEXT: %[[CAST1:.]] = bitcast i64 %[[GEP3]] to <vscale x 8 x i64>*
				; CHECK-NEXT: store <vscale x 8 x i64> %[[REVERSE6]], <vscale x 8 x i64>* %[[CAST1]], align 8

				entry:
				%cmp8 = icmp sgt i64 %N, 0
				br i1 %cmp8, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%i.09.in = phi i64 [ %i.09, %for.body ], [ %N, %entry ]
				%i.09 = add nsw i64 %i.09.in, -1
				%arrayidx = getelementptr inbounds i64, i64* %b, i64 %i.09
				%0 = load i64, i64* %arrayidx, align 8
				%add = add i64 %0, 1
				%arrayidx2 = getelementptr inbounds i64, i64* %a, i64 %i.09
				store i64 %add, i64* %arrayidx2, align 8
				%cmp = icmp sgt i64 %i.09.in, 1
				br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0
				}

				attributes #0 = { "target-cpu"="generic" "target-features"="+neon,+sve" }

				!0 = distinct !{!0, !1, !2, !3, !4}
				!1 = !{!"llvm.loop.mustprogress"}
				!2 = !{!"llvm.loop.vectorize.width", i32 8}
				!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
				!4 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll

This file was added.

				; This is the loop in c++ being vectorize in this file with
				; shuffle reverse

				;#pragma clang loop vectorize_width(4, fixed)
				; for (long int i = N - 1; i >= 0; i--)
				; {
				; if (cond[i])
				; a[i] += 1;
				; }

				; The test checks if the mask is being correctly created, reverted and used

				; RUN: opt -loop-vectorize -dce -instcombine -mtriple aarch64-linux-gnu -S < %s 2>%t \| FileCheck %s

				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning


				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"

				define void @vector_reverse_mask_v4i1(double* %a, double* %cond, i64 %N) #0 {
				; CHECK-LABEL: vector.body:
				david-armUnsubmitted Done Reply Inline Actions nit: Perhaps better renamed to `v4i1` since the CHECK lines have that type? david-arm: nit: Perhaps better renamed to `v4i1` since the CHECK lines have that type?
				; CHECK: %[[REVERSE6:.]] = shufflevector <4 x i1> %{{.}}, <4 x i1> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; CHECK: %[[WIDEMSKLOAD:.]] = call <4 x double> @llvm.masked.load.v4f64.p0v4f64(<4 x double> nonnull %{{.*}}, i32 8, <4 x i1> %[[REVERSE6]], <4 x double> poison)
				; CHECK-NEXT: %[[FADD:.*]] = fadd <4 x double> %[[WIDEMSKLOAD]]
				david-armUnsubmitted Done Reply Inline Actions This seems a bit less well tested than the sve version - perhaps it's worth adding CHECK lines for the store too? I assume the store will reuse the same mask? david-arm: This seems a bit less well tested than the sve version - perhaps it's worth adding CHECK lines…
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Ok, I think it is valid to have similar check's for scalable and fixed vectors CarolineConcatto: Ok, I think it is valid to have similar check's for scalable and fixed vectors
				; CHECK: call void @llvm.masked.store.v4f64.p0v4f64(<4 x double> %[[FADD]], <4 x double>* %{{.*}}, i32 8, <4 x i1> %[[REVERSE6]])

				entry:
				%cmp7 = icmp sgt i64 %N, 0
				br i1 %cmp7, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup, %entry
				ret void

				for.body: ; preds = %for.body, %entry
				%i.08.in = phi i64 [ %i.08, %for.inc ], [ %N, %entry ]
				%i.08 = add nsw i64 %i.08.in, -1
				%arrayidx = getelementptr inbounds double, double* %cond, i64 %i.08
				%0 = load double, double* %arrayidx, align 8
				%tobool = fcmp une double %0, 0.000000e+00
				br i1 %tobool, label %if.then, label %for.inc

				if.then: ; preds = %for.body
				%arrayidx1 = getelementptr inbounds double, double* %a, i64 %i.08
				%1 = load double, double* %arrayidx1, align 8
				%add = fadd double %1, 1.000000e+00
				store double %add, double* %arrayidx1, align 8
				br label %for.inc

				for.inc: ; preds = %for.body, %if.then
				%cmp = icmp sgt i64 %i.08.in, 1
				br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0
				}

				attributes #0 = {"target-cpu"="generic" "target-features"="+neon,+sve"}


				!0 = distinct !{!0, !1, !2, !3, !4}
				!1 = !{!"llvm.loop.mustprogress"}
				!2 = !{!"llvm.loop.vectorize.width", i32 4}
				!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 false}
				!4 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse.ll

This file was added.

				; Test VLA for reverse with fixed size vector
				; This is the loop in c++ being vectorize in this file with
				; shuffle reverse
				; #pragma clang loop vectorize_width(8, fixed)
				; for (int i = N-1; i >= 0; --i)
				; a[i] = b[i] + 1.0;

				; RUN: opt -loop-vectorize -dce -mtriple aarch64-linux-gnu -S < %s 2>%t \| FileCheck %s

				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.$
				; WARN-NOT: warning

				define void @vector_reverse_f64(i64 %N, double* %a, double* %b) #0 {
				; CHECK-LABEL: vector_reverse_f64
				; CHECK-LABEL: vector.body
				; CHECK: %[[GEP:.]] = getelementptr inbounds double, double %{{.*}}, i32 0
				; CHECK-NEXT: %[[GEP1:.]] = getelementptr inbounds double, double %[[GEP]], i32 -7
				; CHECK-NEXT: %[[CAST:.]] = bitcast double %[[GEP1]] to <8 x double>*
				; CHECK-NEXT: %[[WIDE:.]] = load <8 x double>, <8 x double> %[[CAST]], align 8
				; CHECK-NEXT: %[[REVERSE:.*]] = shufflevector <8 x double> %[[WIDE]], <8 x double> poison, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
				; CHECK-NEXT: %[[FADD:.*]] = fadd <8 x double> %[[REVERSE]]
				david-armUnsubmitted Done Reply Inline Actions I think that since you're adding tests for this it's probably a good idea to expand all the shufflevector CHECK lines in this file to make sure we have a reverse mask here too? david-arm: I think that since you're adding tests for this it's probably a good idea to expand all the…
				; CHECK-NEXT: %[[GEP2:.]] = getelementptr inbounds double, double {{.}}, i64 {{.}}
				; CHECK-NEXT: %[[REVERSE6:.*]] = shufflevector <8 x double> %[[FADD]], <8 x double> poison, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
				; CHECK-NEXT: %[[GEP3:.]] = getelementptr inbounds double, double %[[GEP2]], i32 0
				; CHECK-NEXT: %[[GEP4:.]] = getelementptr inbounds double, double %[[GEP3]], i32 -7
				; CHECK-NEXT: %[[CAST:.]] = bitcast double %[[GEP4]] to <8 x double>*
				; CHECK-NEXT: store <8 x double> %[[REVERSE6]], <8 x double>* %[[CAST]], align 8

				david-armUnsubmitted Done Reply Inline Actions nit: %[[REVERSE6]]? david-arm: nit: %[[REVERSE6]]?
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Good catch! CarolineConcatto: Good catch!
				entry:
				%cmp7 = icmp sgt i64 %N, 0
				br i1 %cmp7, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup, %entry
				ret void

				for.body: ; preds = %entry, %for.body
				%i.08.in = phi i64 [ %i.08, %for.body ], [ %N, %entry ]
				%i.08 = add nsw i64 %i.08.in, -1
				%arrayidx = getelementptr inbounds double, double* %b, i64 %i.08
				%0 = load double, double* %arrayidx, align 8
				%add = fadd double %0, 1.000000e+00
				%arrayidx1 = getelementptr inbounds double, double* %a, i64 %i.08
				store double %add, double* %arrayidx1, align 8
				%cmp = icmp sgt i64 %i.08.in, 1
				br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0
				}

				define void @vector_reverse_i64(i64 %N, i64* %a, i64* %b) #0 {
				; CHECK-LABEL: vector_reverse_i64
				; CHECK-LABEL: vector.body
				; CHECK: %[[GEP:.]] = getelementptr inbounds i64, i64 %{{.*}}, i32 0
				; CHECK-NEXT: %[[GEP1:.]] = getelementptr inbounds i64, i64 %[[GEP]], i32 -7
				; CHECK-NEXT: %[[CAST:.]] = bitcast i64 %[[GEP1]] to <8 x i64>*
				; CHECK-NEXT: %[[WIDE:.]] = load <8 x i64>, <8 x i64> %[[CAST]], align 8
				; CHECK-NEXT: %[[REVERSE:.*]] = shufflevector <8 x i64> %[[WIDE]], <8 x i64> poison, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
				; CHECK-NEXT: %[[FADD:.*]] = add <8 x i64> %[[REVERSE]]
				; CHECK-NEXT: %[[GEP2:.]] = getelementptr inbounds i64, i64 {{.}}, i64 {{.}}
				; CHECK-NEXT: %[[REVERSE6:.*]] = shufflevector <8 x i64> %[[FADD]], <8 x i64> poison, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
				; CHECK-NEXT: %[[GEP3:.]] = getelementptr inbounds i64, i64 %[[GEP2]], i32 0
				; CHECK-NEXT: %[[GEP4:.]] = getelementptr inbounds i64, i64 %[[GEP3]], i32 -7
				; CHECK-NEXT: %[[CAST1:.]] = bitcast i64 %[[GEP4]] to <8 x i64>*
				; CHECK-NEXT: store <8 x i64> %[[REVERSE6]], <8 x i64>* %[[CAST1]], align 8

				david-armUnsubmitted Done Reply Inline Actions nit: %[[REVERSE6]] david-arm: nit: %[[REVERSE6]]
				entry:
				%cmp8 = icmp sgt i64 %N, 0
				br i1 %cmp8, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup, %entry
				ret void

				for.body: ; preds = %entry, %for.body
				%i.09.in = phi i64 [ %i.09, %for.body ], [ %N, %entry ]
				%i.09 = add nsw i64 %i.09.in, -1
				%arrayidx = getelementptr inbounds i64, i64* %b, i64 %i.09
				%0 = load i64, i64* %arrayidx, align 8
				%add = add i64 %0, 1
				%arrayidx2 = getelementptr inbounds i64, i64* %a, i64 %i.09
				store i64 %add, i64* %arrayidx2, align 8
				%cmp = icmp sgt i64 %i.09.in, 1
				br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0
				}

				attributes #0 = { "target-cpu"="generic" "target-features"="+neon,+sve" }

				!0 = distinct !{!0, !1, !2, !3, !4}
				!1 = !{!"llvm.loop.mustprogress"}
				!2 = !{!"llvm.loop.vectorize.width", i32 8}
				!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 false}
				!4 = !{!"llvm.loop.vectorize.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[SVE][LoopVectorize] Add support for scalable vectorization of loops with vector reverseClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 330905

llvm/include/llvm/IR/IRBuilder.h

llvm/lib/IR/IRBuilder.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse-mask4.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-vector-reverse.ll

llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll

llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse.ll

[SVE][LoopVectorize] Add support for scalable vectorization of loops with vector reverse
ClosedPublic