This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
9/13
InstCombineCalls.cpp
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
2
get-active-lane-mask.ll

Differential D148120

[InstCombine] Remove scalable get_active_lane_mask calls which are always false
AbandonedPublic

Authored by MattDevereau on Apr 12 2023, 5:52 AM.

Download Raw Diff

Details

Reviewers

david-arm
sdesmalen
nikic

Summary

The LoopVectorizer pass can emit calls to get_active_lane_mask who's operands
are induction variables which in reality only increment in a single loop
iteration before the get_active_lane_mask call, meaning the intrinsic always
returns an all-false predicate. It is possible to detect this in InstCombine
and eliminate the always false intrinsic which allows optimizations further
down the pipeline to trigger for scalable vectors.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MattDevereau created this revision.Apr 12 2023, 5:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2023, 5:52 AM

Herald added subscribers: StephenFan, hiraditya. · View Herald Transcript

MattDevereau requested review of this revision.Apr 12 2023, 5:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2023, 5:52 AM

Herald added subscribers: llvm-commits, alextsao1999. · View Herald Transcript

Harbormaster completed remote builds in B225043: Diff 512798.Apr 12 2023, 7:03 AM

Thanks for this @MattDevereau! Looks like a nice improvement. I just have a few comments ...

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3219	I think the following optimisation should also work for fixed-width vectors, right? It just means that in your `MinVScaleElts` calculations below you wouldn't need to query the vscale minimum.
3231	I wonder if this is the canonical form or not? For example, I don't know if the operands would ever be swapped so the PHI node is on the right? Perhaps in order to make this more robust you could do something like: Value AddOp1, AddOp2; if (!match(Op0, m_Add(m_Value(AddOp1), m_Value(AddOp2)))) break; Value NonPhi = AddOp2; PHINode Phi = dyn_cast<PHINode>(AddOp1); if (!Phi && (Phi = dyn_cast<PHINode>(AddOp2))) NonPhi = AddOp1; if (!Phi) break; Not saying this is the best code obviously, but hopefully it helps to explain what I mean.
3235	I think this condition needs to be a bit stronger, i.e. check the PHI has only two operands and that one incoming value comes from this block, then look for the other incoming value. The problem is that this might not be a loop, but instead just be a block with two predecessors.

Do I understand correctly that this is basically optimizing get.active.lane.mask for the case where we can compute the range of op0 and show that it is always >= op1? And it implements that range calculation for this special case of a post-inc IV?

I don't think this is quite right in that it does not account for addition overflow. That's not actually possible in your specific test cases, but I don't think your implementation has sufficient preconditions to prove this.

Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3254	takeName on a constant doesn't make sense.

In D148120#4262771, @nikic wrote:

Do I understand correctly that this is basically optimizing get.active.lane.mask for the case where we can compute the range of op0 and show that it is always >= op1? And it implements that range calculation for this special case of a post-inc IV?

Yes that's correct. If its impossible for the vectorization factor op0 to be lower than op1 in get.active.lane.mask it will always return an all false mask. This optimization is to fix an edge case kicked out by loop vectorizer which is unable to perform this optimization.

I don't think this is quite right in that it does not account for addition overflow. That's not actually possible in your specific test cases, but I don't think your implementation has sufficient preconditions to prove this.

Would this be as simple as requiring nsw/nuw flags on the add?

Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.

It generates this with low trip counts when LTO is enabled. It tries to vectorize modules with tail folding but the trip count is unknown at this time. After linking it can do a second round of optimizations where the trip count is known and we can replace the get.active.lane.mask call with a constant

In D148120#4267931, @MattDevereau wrote:

In D148120#4262771, @nikic wrote:

Do I understand correctly that this is basically optimizing get.active.lane.mask for the case where we can compute the range of op0 and show that it is always >= op1? And it implements that range calculation for this special case of a post-inc IV?

Yes that's correct. If its impossible for the vectorization factor op0 to be lower than op1 in get.active.lane.mask it will always return an all false mask. This optimization is to fix an edge case kicked out by loop vectorizer which is unable to perform this optimization.

I don't think this is quite right in that it does not account for addition overflow. That's not actually possible in your specific test cases, but I don't think your implementation has sufficient preconditions to prove this.

Would this be as simple as requiring nsw/nuw flags on the add?

Yes, that should be enough.

FYI there is a matchSimpleRecurrence() helper to check for the general "phi + binop recurrence" pattern.

Why does the loop vectorizer generate this code in the first place? Given that it involves reasoning about IVs, it might be more straightforward to handle this in SCEV/LV.

It generates this with low trip counts when LTO is enabled. It tries to vectorize modules with tail folding but the trip count is unknown at this time. After linking it can do a second round of optimizations where the trip count is known and we can replace the get.active.lane.mask call with a constant

Does D148010 fix your issue? This is exactly why vectorizations shouldn't be run pre-link.

nikic mentioned this in D148010: [Pipelines] Don't run module optimization in full LTO pre-link.Apr 14 2023, 7:47 AM

@nikic I've updated this patch just for the sake of it, I expect D148010 should fix the problem and make this patch unnecessary. Unfortunately I only have a small snippet of IR generated by clang but do not have my hands on the source code that generated this. Until I have the source I can't yet conclude if D148010 has fixed the original problem.

MattDevereau marked 3 inline comments as done.Apr 18 2023, 4:50 AM

MattDevereau added inline comments.

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3219	I was under the impression this issue only appeared because of the vscale/vectorisation factor values stopping llvm from recognising everything as constants. There is `get_active_lane_mask` logic in `ConstantFolding.cpp`, however until I have the source code of this problem which I can test with fixed-width vectors, or run the source of this problem with D148010 applied it's hard to say if fixed-width vectors are really an issue.
3231	I've replaced this with some logic based on a call to `matchSimpleRecurrence` which should prove with a bit more tightness that it's actually a loop
3235	I've replaced this with some logic based on a call to matchSimpleRecurrence which should prove with a bit more tightness that it's actually a loop

Harbormaster completed remote builds in B226358: Diff 514603.Apr 18 2023, 5:12 AM

@nikic I'm still interested in landing this patch as this is a problem for tail vectorization on AArch64. Though D148010 would likely fix this, that patch is a large change that seems to be more of a longer term goal. Is it possible to re-evaluate landing this patch?

david-arm added inline comments.Jun 16 2023, 9:12 AM

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3241	I think in reality the add of an induction variable generated by the loop vectoriser won't have the nsw and nuw flags on it so this optimisation won't trigger in practice. I don't think we need to require these flags though for your optimisation to work. You should just be able to check the start value for the PHI is 0, because you know it cannot wrap on the first iteration. This is typically the canonical form for vectoriser output as well I think.
3246	We should also bail out if the it's not zero here I think.
3258	By bailing out earlier for non-zero values of PhiOp0 you can simplify this to: if (MinVScaleElts < Op1->getZExtValue()) break;
llvm/test/Transforms/InstCombine/get-active-lane-mask.ll
43	It would be good to test this without the nuw and nsw flags, since that's what the vectoriser generates.

MattDevereau updated this revision to Diff 539111.Jul 11 2023, 8:06 AM

Herald added a subscriber: wangpc. · View Herald TranscriptJul 11 2023, 8:06 AM

MattDevereau marked 3 inline comments as done.Jul 11 2023, 8:06 AM

Harbormaster completed remote builds in B244483: Diff 539111.Jul 11 2023, 10:36 AM

david-arm added inline comments.Jul 12 2023, 9:11 AM

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3237	I think 'L' and 'R' are a bit misleading here because they are actually 'Start' and 'Step', whereas on reading the current code it sounds like 'Left' and 'Right'. Can you rename them please?
3242	Again, perhaps rename to `StartC`?
3246	In theory, Vf and 'Step' (or 'R' in the current code) should be identical. If not, then we can't apply the optimisation. I'm thinking of a case like this: loop: %idx = phi i64 [ 0, %preheader ], [ %idx.inc, %loop ] ... %idx.new = add i64 %idx, %vf %idx.inc = add i64 %idx, %other_thing %pred = call <vscale x 16 x i1> @llvm.active.get.lane.mask.nxv16i1.i64(%idx.new, %limit) So I think you need to check that `Vf == Step` somehow.
llvm/test/Transforms/InstCombine/get-active-lane-mask.ll
5	I think you either have to make this a generic test without the "target triple" or move this test into a AArch64 directory.

Matt added a subscriber: Matt.Aug 1 2023, 2:29 PM

MattDevereau abandoned this revision.Sep 21 2023, 10:41 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

45 lines

test/

Transforms/

InstCombine/

get-active-lane-mask.ll

102 lines

Diff 539111

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 3,207 Lines • ▼ Show 20 Lines	case Intrinsic::vector_reduce_fmul: {
}		}
break;		break;
}		}
case Intrinsic::is_fpclass: {		case Intrinsic::is_fpclass: {
if (Instruction I = foldIntrinsicIsFPClass(II))		if (Instruction I = foldIntrinsicIsFPClass(II))
return I;		return I;
break;		break;
}		}
		case Intrinsic::get_active_lane_mask: {
		// Try to eliminate get_active_lane_mask instrinsics
		// which always return all false predicates in certain scalable loops
		// when the vectorization factor is known.
		david-armUnsubmitted Not Done Reply Inline Actions I think the following optimisation should also work for fixed-width vectors, right? It just means that in your `MinVScaleElts` calculations below you wouldn't need to query the vscale minimum. david-arm: I think the following optimisation should also work for fixed-width vectors, right? It just…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I was under the impression this issue only appeared because of the vscale/vectorisation factor values stopping llvm from recognising everything as constants. There is `get_active_lane_mask` logic in `ConstantFolding.cpp`, however until I have the source code of this problem which I can test with fixed-width vectors, or run the source of this problem with D148010 applied it's hard to say if fixed-width vectors are really an issue. MattDevereau: I was under the impression this issue only appeared because of the vscale/vectorisation factor…
		if (!II->getType()->isScalableTy())
		break;

		auto Op0 = II->getOperand(0);
		auto Op1 = dyn_cast<ConstantInt>(II->getOperand(1));
		if (!Op1)
		break;

		Value Idx, Vf;
		if (!match(Op0, m_Add(m_Value(Idx), m_Value(Vf))))
		break;

		david-armUnsubmitted Done Reply Inline Actions I wonder if this is the canonical form or not? For example, I don't know if the operands would ever be swapped so the PHI node is on the right? Perhaps in order to make this more robust you could do something like: Value AddOp1, AddOp2; if (!match(Op0, m_Add(m_Value(AddOp1), m_Value(AddOp2)))) break; Value NonPhi = AddOp2; PHINode Phi = dyn_cast<PHINode>(AddOp1); if (!Phi && (Phi = dyn_cast<PHINode>(AddOp2))) NonPhi = AddOp1; if (!Phi) break; Not saying this is the best code obviously, but hopefully it helps to explain what I mean. david-arm: I wonder if this is the canonical form or not? For example, I don't know if the operands would…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've replaced this with some logic based on a call to `matchSimpleRecurrence` which should prove with a bit more tightness that it's actually a loop MattDevereau: I've replaced this with some logic based on a call to `matchSimpleRecurrence` which should…
		auto Phi = dyn_cast<PHINode>(Idx);
		if (!Phi)
		break;

		david-armUnsubmitted Done Reply Inline Actions I think this condition needs to be a bit stronger, i.e. check the PHI has only two operands and that one incoming value comes from this block, then look for the other incoming value. The problem is that this might not be a loop, but instead just be a block with two predecessors. david-arm: I think this condition needs to be a bit stronger, i.e. check the PHI has only two operands and…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've replaced this with some logic based on a call to matchSimpleRecurrence which should prove with a bit more tightness that it's actually a loop MattDevereau: I've replaced this with some logic based on a call to matchSimpleRecurrence which should prove…
		BinaryOperator *BO;
		Value L, R;
		david-armUnsubmitted Not Done Reply Inline Actions I think 'L' and 'R' are a bit misleading here because they are actually 'Start' and 'Step', whereas on reading the current code it sounds like 'Left' and 'Right'. Can you rename them please? david-arm: I think 'L' and 'R' are a bit misleading here because they are actually 'Start' and 'Step'…
		if (!matchSimpleRecurrence(Phi, BO, L, R) \|\|
		BO->getOpcode() != Instruction::Add)
		break;

		david-armUnsubmitted Done Reply Inline Actions I think in reality the add of an induction variable generated by the loop vectoriser won't have the nsw and nuw flags on it so this optimisation won't trigger in practice. I don't think we need to require these flags though for your optimisation to work. You should just be able to check the start value for the PHI is 0, because you know it cannot wrap on the first iteration. This is typically the canonical form for vectoriser output as well I think. david-arm: I think in reality the add of an induction variable generated by the loop vectoriser won't have…
		ConstantInt *PhiOp0 = dyn_cast<ConstantInt>(L);
		david-armUnsubmitted Not Done Reply Inline Actions Again, perhaps rename to `StartC`? david-arm: Again, perhaps rename to `StartC`?
		if (!PhiOp0 \|\| !PhiOp0->isZero())
		break;
		ConstantInt *ShlValue;
		if (!match(Vf, m_Shl(m_VScale(), m_ConstantInt(ShlValue))))
		david-armUnsubmitted Done Reply Inline Actions We should also bail out if the it's not zero here I think. david-arm: We should also bail out if the it's not zero here I think.
		david-armUnsubmitted Not Done Reply Inline Actions In theory, Vf and 'Step' (or 'R' in the current code) should be identical. If not, then we can't apply the optimisation. I'm thinking of a case like this: loop: %idx = phi i64 [ 0, %preheader ], [ %idx.inc, %loop ] ... %idx.new = add i64 %idx, %vf %idx.inc = add i64 %idx, %other_thing %pred = call <vscale x 16 x i1> @llvm.active.get.lane.mask.nxv16i1.i64(%idx.new, %limit) So I think you need to check that `Vf == Step` somehow. david-arm: In theory, Vf and 'Step' (or 'R' in the current code) should be identical. If not, then we…
		break;

		Attribute VScaleAttr =
		II->getFunction()->getFnAttribute(Attribute::VScaleRange);
		if (!VScaleAttr.isValid())
		break;
		unsigned VScaleMin = VScaleAttr.getVScaleRangeMin();
		uint64_t MinVScaleElts = VScaleMin * ShlValue->getZExtValue();
		nikicUnsubmitted Done Reply Inline Actions takeName on a constant doesn't make sense. nikic: takeName on a constant doesn't make sense.
		if (MinVScaleElts < Op1->getZExtValue())
		break;

		auto PFalse = Constant::getNullValue(II->getType());
		david-armUnsubmitted Done Reply Inline Actions By bailing out earlier for non-zero values of PhiOp0 you can simplify this to: if (MinVScaleElts < Op1->getZExtValue()) break; david-arm: By bailing out earlier for non-zero values of PhiOp0 you can simplify this to: if…
		return replaceInstUsesWith(*II, PFalse);
		}
default: {		default: {
// Handle target specific intrinsics		// Handle target specific intrinsics
std::optional<Instruction > V = targetInstCombineIntrinsic(II);		std::optional<Instruction > V = targetInstCombineIntrinsic(II);
if (V)		if (V)
return *V;		return *V;
break;		break;
}		}
}		}
▲ Show 20 Lines • Show All 892 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/get-active-lane-mask.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
				; RUN: opt -passes=instcombine -S < %s \| FileCheck %s

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"
				david-armUnsubmitted Not Done Reply Inline Actions I think you either have to make this a generic test without the "target triple" or move this test into a AArch64 directory. david-arm: I think you either have to make this a generic test without the "target triple" or move this…

				define void @eliminate_always_false_scalable_get_active_lane_mask_in_loop(ptr %dst, ptr %src) #0 {
				; CHECK-LABEL: define void @eliminate_always_false_scalable_get_active_lane_mask_in_loop
				; CHECK-SAME: (ptr [[DST:%.]], ptr [[SRC:%.]]) #[[ATTR0:[0-9]+]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[VSCALE:%.*]] = tail call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[VF:%.*]] = shl nuw nsw i64 [[VSCALE]], 4
				; CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4)
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ zeroinitializer, [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[INDEX]]
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP0]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
				; CHECK-NEXT: [[TMP1:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i32 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[INDEX]]
				; CHECK-NEXT: tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP1]], ptr [[TMP2]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]
				; CHECK-NEXT: br i1 false, label [[VECTOR_BODY]], label [[EXIT:%.*]]
				; CHECK: exit:
				; CHECK-NEXT: ret void
				;
				entry:
				%vscale = tail call i64 @llvm.vscale.i64()
				%vf = shl nuw nsw i64 %vscale, 4
				%active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
				%active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %entry ], [ %active.lane.mask.next, %vector.body ]
				%0 = getelementptr inbounds i8, ptr %src, i64 %index
				%wide.masked.load = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %0, i32 1, <vscale x 16 x i1> %active.lane.mask, <vscale x 16 x i8> poison)
				%1 = shl <vscale x 16 x i8> %wide.masked.load, shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i32 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
				%2 = getelementptr inbounds i8, ptr %dst, i64 %index
				tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> %1, ptr %2, i32 1, <vscale x 16 x i1> %active.lane.mask)
				%index.next = add i64 %index, %vf
				%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next, i64 4)
				david-armUnsubmitted Not Done Reply Inline Actions It would be good to test this without the nuw and nsw flags, since that's what the vectoriser generates. david-arm: It would be good to test this without the nuw and nsw flags, since that's what the vectoriser…
				%3 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
				br i1 %3, label %vector.body, label %exit

				exit: ; preds = %vector.body
				ret void
				}

				define void @neg_get_active_lane_mask_in_loop(ptr %dst, ptr %src) #0 {
				; CHECK-LABEL: define void @neg_get_active_lane_mask_in_loop
				; CHECK-SAME: (ptr [[DST:%.]], ptr [[SRC:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[VSCALE:%.*]] = tail call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[VF:%.*]] = shl nuw nsw i64 [[VSCALE]], 1
				; CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4)
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[INDEX]]
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP0]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
				; CHECK-NEXT: [[TMP1:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i32 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[INDEX]]
				; CHECK-NEXT: tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP1]], ptr [[TMP2]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw nsw i64 [[INDEX]], [[VF]]
				; CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 4)
				; CHECK-NEXT: [[TMP3:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
				; CHECK-NEXT: br i1 [[TMP3]], label [[VECTOR_BODY]], label [[EXIT:%.*]]
				; CHECK: exit:
				; CHECK-NEXT: ret void
				;
				entry:
				%vscale = tail call i64 @llvm.vscale.i64()
				%vf = shl nuw nsw i64 %vscale, 1
				%active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4)
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
				%active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %entry ], [ %active.lane.mask.next, %vector.body ]
				%0 = getelementptr inbounds i8, ptr %src, i64 %index
				%wide.masked.load = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr %0, i32 1, <vscale x 16 x i1> %active.lane.mask, <vscale x 16 x i8> poison)
				%1 = shl <vscale x 16 x i8> %wide.masked.load, shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i32 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer)
				%2 = getelementptr inbounds i8, ptr %dst, i64 %index
				tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> %1, ptr %2, i32 1, <vscale x 16 x i1> %active.lane.mask)
				%index.next = add nuw nsw i64 %index, %vf
				%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next, i64 4)
				%3 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
				br i1 %3, label %vector.body, label %exit

				exit: ; preds = %vector.body
				ret void
				}

				declare i64 @llvm.vscale.i64()
				declare <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64, i64)
				declare <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr, i32 immarg, <vscale x 16 x i1>, <vscale x 16 x i8>)
				declare void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8>, ptr, i32 immarg, <vscale x 16 x i1>)

				attributes #0 = { vscale_range(1,16) "target-cpu"="neoverse-v1" "target-features"="+sve" }