This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/10
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
sve-widen-gep.ll
-
sve-widen-phi.ll

Differential D109445

[SVE][LoopVectorize] Optimise code generated by widenPHIInstruction
ClosedPublic

Authored by RosieSumpter on Sep 8 2021, 9:04 AM.

Download Raw Diff

Details

Reviewers

efriedma
david-arm
dmgreen
sdesmalen

Commits

rG9d1bea9c88b3: [SVE][LoopVectorize] Optimise code generated by widenPHIInstruction

Summary

For SVE, when scalarising the PHI instruction the whole vector part is
generated as opposed to creating instructions for each lane for fixed-
width vectors. However, in some cases the lane values may be needed
later (e.g for a load instruction) so we still need to calculate
these values to avoid extractelement being called on the vector part.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

RosieSumpter created this revision.Sep 8 2021, 9:04 AM

Herald added a reviewer: efriedma. · View Herald TranscriptSep 8 2021, 9:04 AM

Herald added subscribers: ctetreau, psnobl, hiraditya, tschuett. · View Herald Transcript

RosieSumpter requested review of this revision.Sep 8 2021, 9:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2021, 9:04 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

RosieSumpter added reviewers: david-arm, dmgreen.Sep 8 2021, 9:10 AM

Harbormaster completed remote builds in B123067: Diff 371359.Sep 8 2021, 9:42 AM

RosieSumpter added a reviewer: sdesmalen.Sep 9 2021, 2:05 AM

SjoerdMeijer added a subscriber: SjoerdMeijer.Sep 9 2021, 3:56 AM

Just a minor nit on the commit message, this patch is not really specific to AArch64 SVE but rather to scalable vectors.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771	Hi @RosieSumpter, I think it's worth elaborating a little bit more on the 'generate better code' in the comment. [(too) long explanation here] From what I understand, the code is better because the `extractelement` instruction that is otherwise generated (for scalar uses of this vector) may not always be folded away if the stepvector has multiple uses, leading to a redundant move (in case of element 0 for the vector-element-0 -> gpr move) or possibly expensive extractelement instructions (to extract a fixed-width lane from a scalable vector) for element > 0. In the former case, the value for element 0 is freely available because it is the start value of the stepvector. In the latter case, there will be a cost regardless. Either the additional `add/gep` generated below to offset from the start value of the stepvector, or the extract from the stepvector itself. It's just expected that the scalar code will be cheaper. Can you maybe capture some of that in the comment? (albeit more succinctly)

SjoerdMeijer added inline comments.Sep 9 2021, 8:16 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771	I had exactly the same questions as Sander. The main question I think is indeed why this is better, which it's not (that) obvious from the test changes. Thus, I was wondering, does this deserve adding some CodeGen tests?

sdesmalen added inline comments.Sep 9 2021, 8:25 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771	IMO an improved description should be sufficient.

SjoerdMeijer added inline comments.Sep 9 2021, 9:00 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771	Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this? I think it would help too with explaining why this is better.

Improved comment

RosieSumpter added inline comments.Sep 10 2021, 1:28 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771	For now I have just updated the comment (hopefully it makes sense - I have tried to add some detail but keep it concise, but am happy to change it!) Also happy to add a codegen test if it's decided that it's necessary.

Thanks @RosieSumpter that explains it well I think.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771	Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this? Correct, it normally doesn't happen for individual IR passes that the resulting asm for a particular target is tested for a change to the transformation, right? IMO that should be avoided whenever possible, because it defeats the purpose of a unit test. In this case it should be sufficient to test explicitly that the `extractelement` does not exist after loop-vectorize. That happens with the extra `CHECK-NOT` line in sve-widen-phi.ll. The fmov is the code-generated equivalent to the extract element of element 0, so would there be any value in also code-generating this for AArch64 to show that the fmov is removed? Perhaps @RosieSumpter can add a similar comment to clarify the `CHECK-NOT` line to clarify this a bit more in the test itself?
4771	nit: s/is not folded/is not folded away/
4771–4772	nit: s/we still need to calculate/we still calculate/ nit: s/to avoid a redundant move/to avoid redundant moves/

SjoerdMeijer added inline comments.Sep 10 2021, 2:16 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771	I was proposing an additional test, thus I was not proposing that a codegen test should replace an IR test, which would indeed be a bad idea. This change is/was talking about codegen improvements. The IR changes here don't make that very obvious IMO. So if I was doing this work, I would add additional codegen tests with the before/after IR as input, and check its codegen to make sure these improvements are there. Not sure if I am missing something or should be discussing adding tests here. Don't have strong opinions either, so will leave finishing the review up to you.

Harbormaster completed remote builds in B123384: Diff 371819.Sep 10 2021, 2:28 AM

Amended original comment, and added comment to sve-widen-phi.ll test.

sdesmalen accepted this revision.Sep 10 2021, 3:52 AM

sdesmalen added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771	Thanks, I understood what you meant about adding an additional test. In this case it's more about InstCombine not folding it away because the vector has multiple uses than it is about the target not being able to do something special with it, so from my perspective adding a codegen test for this would be a bit out of place.

This revision is now accepted and ready to land.Sep 10 2021, 3:52 AM

This revision was landed with ongoing or failed builds.Sep 10 2021, 4:06 AM

Closed by commit rG9d1bea9c88b3: [SVE][LoopVectorize] Optimise code generated by widenPHIInstruction (authored by RosieSumpter). · Explain Why

This revision was automatically updated to reflect the committed changes.

RosieSumpter added a commit: rG9d1bea9c88b3: [SVE][LoopVectorize] Optimise code generated by widenPHIInstruction.

Harbormaster completed remote builds in B123409: Diff 371859.Sep 10 2021, 4:09 AM

kmclaughlin added a subscriber: kmclaughlin.Sep 23 2021, 5:45 AM

This comment was removed by kmclaughlin.

Hi all, apologies for my last comment on this patch - a mistake with email auto-complete led to an internal email being forwarded to Phabricator.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

11 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-widen-gep.ll

35 lines

sve-widen-phi.ll

5 lines

Diff 371867

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,759 Lines • ▼ Show 20 Lines	if (Cost->isScalarAfterVectorization(P, State.VF)) {
PtrIndSplat = Builder.CreateVectorSplat(VF, PtrInd);		PtrIndSplat = Builder.CreateVectorSplat(VF, PtrInd);
}		}

for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Value *PartStart = createStepForVF(		Value *PartStart = createStepForVF(
Builder, ConstantInt::get(PtrInd->getType(), Part), VF);		Builder, ConstantInt::get(PtrInd->getType(), Part), VF);

if (NeedsVectorIndex) {		if (NeedsVectorIndex) {
		// Here we cache the whole vector, which means we can support the
		// extraction of any lane. However, in some cases the extractelement
		// instruction that is generated for scalar uses of this vector (e.g.
		// a load instruction) is not folded away. Therefore we still
		sdesmalenUnsubmitted Not Done Reply Inline Actions Hi @RosieSumpter, I think it's worth elaborating a little bit more on the 'generate better code' in the comment. [(too) long explanation here] From what I understand, the code is better because the `extractelement` instruction that is otherwise generated (for scalar uses of this vector) may not always be folded away if the stepvector has multiple uses, leading to a redundant move (in case of element 0 for the vector-element-0 -> gpr move) or possibly expensive extractelement instructions (to extract a fixed-width lane from a scalable vector) for element > 0. In the former case, the value for element 0 is freely available because it is the start value of the stepvector. In the latter case, there will be a cost regardless. Either the additional `add/gep` generated below to offset from the start value of the stepvector, or the extract from the stepvector itself. It's just expected that the scalar code will be cheaper. Can you maybe capture some of that in the comment? (albeit more succinctly) sdesmalen: Hi @RosieSumpter, I think it's worth elaborating a little bit more on the 'generate better…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I had exactly the same questions as Sander. The main question I think is indeed why this is better, which it's not (that) obvious from the test changes. Thus, I was wondering, does this deserve adding some CodeGen tests? SjoerdMeijer: I had exactly the same questions as Sander. The main question I think is indeed why this is…
		sdesmalenUnsubmitted Not Done Reply Inline Actions IMO an improved description should be sufficient. sdesmalen: IMO an improved description should be sufficient.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this? I think it would help too with explaining why this is better. SjoerdMeijer: Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this? Correct, it normally doesn't happen for individual IR passes that the resulting asm for a particular target is tested for a change to the transformation, right? IMO that should be avoided whenever possible, because it defeats the purpose of a unit test. In this case it should be sufficient to test explicitly that the `extractelement` does not exist after loop-vectorize. That happens with the extra `CHECK-NOT` line in sve-widen-phi.ll. The fmov is the code-generated equivalent to the extract element of element 0, so would there be any value in also code-generating this for AArch64 to show that the fmov is removed? Perhaps @RosieSumpter can add a similar comment to clarify the `CHECK-NOT` line to clarify this a bit more in the test itself? sdesmalen: > Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I was proposing an additional test, thus I was not proposing that a codegen test should replace an IR test, which would indeed be a bad idea. This change is/was talking about codegen improvements. The IR changes here don't make that very obvious IMO. So if I was doing this work, I would add additional codegen tests with the before/after IR as input, and check its codegen to make sure these improvements are there. Not sure if I am missing something or should be discussing adding tests here. Don't have strong opinions either, so will leave finishing the review up to you. SjoerdMeijer: I was proposing an additional test, thus I was not proposing that a codegen test should replace…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Thanks, I understood what you meant about adding an additional test. In this case it's more about InstCombine not folding it away because the vector has multiple uses than it is about the target not being able to do something special with it, so from my perspective adding a codegen test for this would be a bit out of place. sdesmalen: Thanks, I understood what you meant about adding an additional test. In this case it's more…
		RosieSumpterAuthorUnsubmitted Done Reply Inline Actions For now I have just updated the comment (hopefully it makes sense - I have tried to add some detail but keep it concise, but am happy to change it!) Also happy to add a codegen test if it's decided that it's necessary. RosieSumpter: For now I have just updated the comment (hopefully it makes sense - I have tried to add some…
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: s/is not folded/is not folded away/ sdesmalen: nit: s/is not folded/is not folded away/
		// calculate values for the first n lanes to avoid redundant moves
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: s/we still need to calculate/we still calculate/ nit: s/to avoid a redundant move/to avoid redundant moves/ sdesmalen: nit: s/we still need to calculate/we still calculate/ nit: s/to avoid a redundant move/to avoid…
		// (when extracting the 0th element) and to produce scalar code (i.e.
		// additional add/gep instructions instead of expensive extractelement
		// instructions) when extracting higher-order elements.
Value *PartStartSplat = Builder.CreateVectorSplat(VF, PartStart);		Value *PartStartSplat = Builder.CreateVectorSplat(VF, PartStart);
Value *Indices = Builder.CreateAdd(PartStartSplat, UnitStepVec);		Value *Indices = Builder.CreateAdd(PartStartSplat, UnitStepVec);
Value *GlobalIndices = Builder.CreateAdd(PtrIndSplat, Indices);		Value *GlobalIndices = Builder.CreateAdd(PtrIndSplat, Indices);
Value *SclrGep =		Value *SclrGep =
emitTransformedIndex(Builder, GlobalIndices, PSE.getSE(), DL, II);		emitTransformedIndex(Builder, GlobalIndices, PSE.getSE(), DL, II);
SclrGep->setName("next.gep");		SclrGep->setName("next.gep");
State.set(PhiR, SclrGep, Part);		State.set(PhiR, SclrGep, Part);
// We've cached the whole vector, which means we can support the
// extraction of any lane.
continue;
}		}

for (unsigned Lane = 0; Lane < Lanes; ++Lane) {		for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
Value *Idx = Builder.CreateAdd(		Value *Idx = Builder.CreateAdd(
PartStart, ConstantInt::get(PtrInd->getType(), Lane));		PartStart, ConstantInt::get(PtrInd->getType(), Lane));
Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx);		Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx);
Value *SclrGep =		Value *SclrGep =
emitTransformedIndex(Builder, GlobalIdx, PSE.getSE(), DL, II);		emitTransformedIndex(Builder, GlobalIdx, PSE.getSE(), DL, II);
▲ Show 20 Lines • Show All 5,803 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-gep.ll

	Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8** [[START_1]], i64 [[TMP5]]			; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8** [[START_1]], i64 [[TMP5]]
	; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()			; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
	; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[INDEX]], i32 0			; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[INDEX]], i32 0
	; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer			; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP7:%.*]] = add <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 0, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), [[TMP6]]			; CHECK-NEXT: [[TMP7:%.*]] = add <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 0, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), [[TMP6]]
	; CHECK-NEXT: [[TMP8:%.*]] = add <vscale x 2 x i64> [[DOTSPLAT]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = add <vscale x 2 x i64> [[DOTSPLAT]], [[TMP7]]
	; CHECK-NEXT: [[NEXT_GEP4:%.]] = getelementptr i8, i8 [[START_2]], <vscale x 2 x i64> [[TMP8]]			; CHECK-NEXT: [[NEXT_GEP4:%.]] = getelementptr i8, i8 [[START_2]], <vscale x 2 x i64> [[TMP8]]
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i8, <vscale x 2 x i8> [[NEXT_GEP4]], i64 1			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr i8, i8** [[NEXT_GEP]], i32 0			; CHECK-NEXT: [[NEXT_GEP5:%.]] = getelementptr i8, i8 [[START_2]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = bitcast i8* [[TMP10]] to <vscale x 2 x i8>			; CHECK-NEXT: [[TMP10:%.*]] = add i64 [[INDEX]], 1
	; CHECK-NEXT: store <vscale x 2 x i8> [[TMP9]], <vscale x 2 x i8>* [[TMP11]], align 8			; CHECK-NEXT: [[NEXT_GEP6:%.]] = getelementptr i8, i8 [[START_2]], i64 [[TMP10]]
	; CHECK-NEXT: [[TMP12:%.]] = extractelement <vscale x 2 x i8> [[NEXT_GEP4]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i8, <vscale x 2 x i8> [[NEXT_GEP4]], i64 1
	; CHECK-NEXT: [[TMP13:%.]] = getelementptr i8, i8 [[TMP12]], i32 0			; CHECK-NEXT: [[TMP12:%.]] = getelementptr i8, i8** [[NEXT_GEP]], i32 0
	; CHECK-NEXT: [[TMP14:%.]] = bitcast i8 [[TMP13]] to <vscale x 2 x i8>*			; CHECK-NEXT: [[TMP13:%.]] = bitcast i8* [[TMP12]] to <vscale x 2 x i8>
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 2 x i8>, <vscale x 2 x i8> [[TMP14]], align 1			; CHECK-NEXT: store <vscale x 2 x i8> [[TMP11]], <vscale x 2 x i8>* [[TMP13]], align 8
	; CHECK-NEXT: [[TMP15:%.*]] = add <vscale x 2 x i8> [[WIDE_LOAD]], shufflevector (<vscale x 2 x i8> insertelement (<vscale x 2 x i8> poison, i8 1, i32 0), <vscale x 2 x i8> poison, <vscale x 2 x i32> zeroinitializer)			; CHECK-NEXT: [[TMP14:%.]] = getelementptr i8, i8 [[NEXT_GEP5]], i32 0
	; CHECK-NEXT: [[TMP16:%.]] = bitcast i8 [[TMP13]] to <vscale x 2 x i8>*			; CHECK-NEXT: [[TMP15:%.]] = bitcast i8 [[TMP14]] to <vscale x 2 x i8>*
	; CHECK-NEXT: store <vscale x 2 x i8> [[TMP15]], <vscale x 2 x i8>* [[TMP16]], align 1			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 2 x i8>, <vscale x 2 x i8> [[TMP15]], align 1
	; CHECK-NEXT: [[TMP17:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP16:%.*]] = add <vscale x 2 x i8> [[WIDE_LOAD]], shufflevector (<vscale x 2 x i8> insertelement (<vscale x 2 x i8> poison, i8 1, i32 0), <vscale x 2 x i8> poison, <vscale x 2 x i32> zeroinitializer)
	; CHECK-NEXT: [[TMP18:%.*]] = mul i64 [[TMP17]], 2			; CHECK-NEXT: [[TMP17:%.]] = bitcast i8 [[TMP14]] to <vscale x 2 x i8>*
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP18]]			; CHECK-NEXT: store <vscale x 2 x i8> [[TMP16]], <vscale x 2 x i8>* [[TMP17]], align 1
	; CHECK-NEXT: [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP19]]
				; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[BC_RESUME_VAL1:%.]] = phi i8* [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[START_1]], [[ENTRY]] ]			; CHECK-NEXT: [[BC_RESUME_VAL1:%.]] = phi i8* [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[START_1]], [[ENTRY]] ]
	; CHECK-NEXT: [[BC_RESUME_VAL2:%.]] = phi i8 [ [[IND_END3]], [[MIDDLE_BLOCK]] ], [ [[START_2]], [[ENTRY]] ]			; CHECK-NEXT: [[BC_RESUME_VAL2:%.]] = phi i8 [ [[IND_END3]], [[MIDDLE_BLOCK]] ], [ [[START_2]], [[ENTRY]] ]
	; CHECK-NEXT: br label [[LOOP_BODY:%.*]]			; CHECK-NEXT: br label [[LOOP_BODY:%.*]]
	▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-phi.ll

	Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @pointer_iv_mixed(			; CHECK-LABEL: @pointer_iv_mixed(
	; CHECK: vector.body			; CHECK: vector.body
	; CHECK: %[[IDX:.]] = phi i64 [ 0, %vector.ph ], [ %{{.}}, %vector.body ]			; CHECK: %[[IDX:.]] = phi i64 [ 0, %vector.ph ], [ %{{.}}, %vector.body ]
	; CHECK: %[[STEPVEC:.*]] = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()			; CHECK: %[[STEPVEC:.*]] = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
	; CHECK-NEXT: %[[TMP1:.*]] = insertelement <vscale x 2 x i64> poison, i64 %[[IDX]], i32 0			; CHECK-NEXT: %[[TMP1:.*]] = insertelement <vscale x 2 x i64> poison, i64 %[[IDX]], i32 0
	; CHECK-NEXT: %[[TMP2:.*]] = shufflevector <vscale x 2 x i64> %[[TMP1]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer			; CHECK-NEXT: %[[TMP2:.*]] = shufflevector <vscale x 2 x i64> %[[TMP1]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
	; CHECK-NEXT: %[[VECIND1:.*]] = add <vscale x 2 x i64> %[[TMP2]], %[[STEPVEC]]			; CHECK-NEXT: %[[VECIND1:.*]] = add <vscale x 2 x i64> %[[TMP2]], %[[STEPVEC]]
	; CHECK-NEXT: %[[APTRS1:.]] = getelementptr i32, i32 %a, <vscale x 2 x i64> %[[VECIND1]]			; CHECK-NEXT: %[[APTRS1:.]] = getelementptr i32, i32 %a, <vscale x 2 x i64> %[[VECIND1]]
				; CHECK-NEXT: %[[GEPA1:.]] = getelementptr i32, i32 %a, i64 %[[IDX]]
	; CHECK-NEXT: %[[VSCALE64:.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: %[[VSCALE64:.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: %[[VSCALE64X2:.*]] = shl i64 %[[VSCALE64]], 1			; CHECK-NEXT: %[[VSCALE64X2:.*]] = shl i64 %[[VSCALE64]], 1
	; CHECK-NEXT: %[[TMP3:.*]] = insertelement <vscale x 2 x i64> poison, i64 %[[VSCALE64X2]], i32 0			; CHECK-NEXT: %[[TMP3:.*]] = insertelement <vscale x 2 x i64> poison, i64 %[[VSCALE64X2]], i32 0
	; CHECK-NEXT: %[[TMP4:.*]] = shufflevector <vscale x 2 x i64> %[[TMP3]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer			; CHECK-NEXT: %[[TMP4:.*]] = shufflevector <vscale x 2 x i64> %[[TMP3]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
	; CHECK-NEXT: %[[TMP5:.*]] = add <vscale x 2 x i64> %[[TMP4]], %[[STEPVEC]]			; CHECK-NEXT: %[[TMP5:.*]] = add <vscale x 2 x i64> %[[TMP4]], %[[STEPVEC]]
	; CHECK-NEXT: %[[VECIND2:.*]] = add <vscale x 2 x i64> %[[TMP2]], %[[TMP5]]			; CHECK-NEXT: %[[VECIND2:.*]] = add <vscale x 2 x i64> %[[TMP2]], %[[TMP5]]
	; CHECK-NEXT: %[[APTRS2:.]] = getelementptr i32, i32 %a, <vscale x 2 x i64> %[[VECIND2]]			; CHECK-NEXT: %[[APTRS2:.]] = getelementptr i32, i32 %a, <vscale x 2 x i64> %[[VECIND2]]
	; CHECK-NEXT: %[[GEPB1:.]] = getelementptr i32, i32** %b, i64 %[[IDX]]			; CHECK-NEXT: %[[GEPB1:.]] = getelementptr i32, i32** %b, i64 %[[IDX]]
				; The following checks that there is no extractelement after
				; vectorization when the stepvector has multiple uses, which demonstrates
				; the removal of a redundant fmov instruction in the generated asm code.
				; CHECK-NOT: %[[EXTRACT:.]] = extractelement <vscale x 2 x i32> [[APTRS1]], i32 0
	; CHECK: %[[BPTR1:.]] = bitcast i32* %[[GEPB1]] to <vscale x 2 x i32>			; CHECK: %[[BPTR1:.]] = bitcast i32* %[[GEPB1]] to <vscale x 2 x i32>
	; CHECK-NEXT: store <vscale x 2 x i32> %[[APTRS1]], <vscale x 2 x i32>* %[[BPTR1]], align 8			; CHECK-NEXT: store <vscale x 2 x i32> %[[APTRS1]], <vscale x 2 x i32>* %[[BPTR1]], align 8
	; CHECK: %[[VSCALE32:.*]] = call i32 @llvm.vscale.i32()			; CHECK: %[[VSCALE32:.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: %[[VSCALE32X2:.*]] = shl i32 %[[VSCALE32]], 1			; CHECK-NEXT: %[[VSCALE32X2:.*]] = shl i32 %[[VSCALE32]], 1
	; CHECK-NEXT: %[[TMP6:.*]] = sext i32 %[[VSCALE32X2]] to i64			; CHECK-NEXT: %[[TMP6:.*]] = sext i32 %[[VSCALE32X2]] to i64
	; CHECK-NEXT: %[[GEPB2:.]] = getelementptr i32, i32** %[[GEPB1]], i64 %[[TMP6]]			; CHECK-NEXT: %[[GEPB2:.]] = getelementptr i32, i32** %[[GEPB1]], i64 %[[TMP6]]
	; CHECK-NEXT: %[[BPTR2:.]] = bitcast i32* %[[GEPB2]] to <vscale x 2 x i32>			; CHECK-NEXT: %[[BPTR2:.]] = bitcast i32* %[[GEPB2]] to <vscale x 2 x i32>
	; CHECK-NEXT store <vscale x 2 x i32> %[[APTRS2]], <vscale x 2 x i32>* %[[BPTR2]], align 8			; CHECK-NEXT store <vscale x 2 x i32> %[[APTRS2]], <vscale x 2 x i32>* %[[BPTR2]], align 8
	Show All 32 Lines