This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
CodeGen/
-
TargetLowering.h
-
IR/
1/2
Intrinsics.td
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
5/9
SelectionDAGBuilder.cpp
-
Passes/
1/2
PassBuilderPipelines.cpp
-
Target/AArch64/
-
AArch64/
-
AArch64.h
-
AArch64ISelLowering.h
-
AArch64ISelLowering.cpp
-
AArch64LoopIdiomRecognize.h
20/22
AArch64LoopIdiomRecognize.cpp
-
AArch64SVEInstrInfo.td
-
AArch64TargetMachine.h
-
AArch64TargetMachine.cpp
-
CMakeLists.txt
-
test/
-
CodeGen/AArch64/
-
AArch64/
-
intrinsic-cttz-elts.ll
-
Other/
-
new-pm-defaults.ll
-
Transforms/
-
LoopIdiom/AArch64/
-
AArch64/
-
byte-compare-index.ll
-
PhaseOrdering/ARM/
-
ARM/
-
arm_mean_q7.ll
-
utils/gn/secondary/llvm/lib/Target/AArch64/
-
gn/
-
secondary/
-
llvm/
-
lib/
-
Target/
-
AArch64/
-
BUILD.gn

Differential D158291

[PoC][WIP] Add an AArch64 specific pass for loop idiom recognition
Needs ReviewPublic

Authored by david-arm on Aug 18 2023, 9:31 AM.

Download Raw Diff

Details

Reviewers

kmclaughlin

Summary

This pass looks for loops such as the following:

while (i != max_len)
    if (a[i] != b[i])
        break;

Although similar to a memcmp, this is slightly difference because instead of returning
the difference between the values of the first non-matching pair of bytes, it returns
the index of the first mismatch. As such, we are not able to lower this to a memcmp call.
Replacing this pattern with a specialised predicated SVE loop gives a significant
performance improvement for AArch64.

This patch introduces a new pass which identifies this pattern and replaces it with the
SVE loop. It is intended as a short-term solution until this is handled in the vectoriser.

A new intrinsic is created in this patch for counting the trailing zero elements in a
vector which has generic lowering in SelectionDAGBuilder. For AArch64 where SVE is
enabled, this is replaced with brkb & cntp instructions.

Patch co-authored by Kerry McLaughlin (@kmclaughlin) and David Sherwood (@david-arm)

Note: This is a work in progress, see discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383

Diff Detail

Event Timeline

kmclaughlin created this revision.Aug 18 2023, 9:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2023, 9:31 AM

Herald added subscribers: ctetreau, arphaman, hiraditya, kristof.beyls. · View Herald Transcript

kmclaughlin requested review of this revision.Aug 18 2023, 9:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2023, 9:31 AM

Herald added subscribers: llvm-commits, jdoerfert. · View Herald Transcript

Matt added a subscriber: Matt.Aug 18 2023, 9:35 AM

kmclaughlin edited the summary of this revision. (Show Details)Aug 18 2023, 9:36 AM

kmclaughlin added a subscriber: david-arm.

craig.topper added a subscriber: craig.topper.Aug 18 2023, 9:46 AM

craig.topper added inline comments.

llvm/include/llvm/IR/Intrinsics.td
2184	I wonder if something like "find first nonzero element" would be better?
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7506	changeVectorElementType doesn't work if the source type is an MVT and the resulting type is not an MVT. Probably better to use getVectorVT. There have been two recent bug fixes for something like this https://reviews.llvm.org/D157392 and 512a6c50e87c1956c028daf3317b07b3aa0e309f

efriedma added a subscriber: efriedma.Aug 18 2023, 11:08 AM

efriedma added inline comments.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7520	Is the upper bound here guaranteed to fit into an i64?
7526	This multiply can overflow?
7527	We never want to increase EltWidth beyond the width of the result, I think? (If width of the vector doesn't fit into the return value of cttz.elts, is the result poison, or something else?)
7543	Is there some reason to use SMAX instead of UMAX? It seems to complicate reasoning about the sign bit.
llvm/lib/Passes/PassBuilderPipelines.cpp
623	Did you really mean to remove LoopIdiomRecognize and replace it with a second run of IndVarSimplify? I'm not sure why this patch requires messing with the default pass pipeline.

Harbormaster completed remote builds in B253521: Diff 551549.Aug 18 2023, 11:44 AM

ktkachov added a subscriber: ktkachov.Aug 22 2023, 1:42 AM

rui.zhang added a subscriber: rui.zhang.Aug 22 2023, 6:46 AM

kmclaughlin mentioned this in D159283: Add intrinsic to count trailing zero elements in a vector.Aug 31 2023, 7:24 AM

kmclaughlin marked an inline comment as done.Aug 31 2023, 7:49 AM

kmclaughlin added a subscriber: eli.friedman.

kmclaughlin added inline comments.

llvm/include/llvm/IR/Intrinsics.td
2184	This was something we considered, but we wanted to add an intrinsic which mirrors the behaviour of the existing cttz intrinsic. In D159283 I've added a second operand to indicate whether the result is poison if the first argument is all zero, similar to cttz.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7506	Thank you @craig.topper, this has been addressed in D159283.
7520	Hi @eli.friedman, I believe it is safe to assume that the upper bound will fit into an i64. Calls to the vscale intrinsic are often 64 bits when generated by the vectoriser, and the getVScaleRangeMin/getVScaleRangeMax functions themselves are returning unsigned types.
7526	I've tried to address this on D159283 by calculating the smallest possible type using the `umul_sat` operation of ConstantRange.
7543	There was no reason for choosing SMAX, I have updated this to use UMAX instead.
llvm/lib/Passes/PassBuilderPipelines.cpp
623	Removing the LoopIdiomRecognize is a mistake, I only intended to move the IndVarSimplify pass after `invokeLateLoopOptimizationsEPCallbacks` so that the new pass runs as close to LoopIdiomRecognize as possible.

craig.topper added inline comments.Sep 1 2023, 8:35 AM

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
283	I think you want m_Specific(Index) instead of m_Instruction. m_Instruction will match any instruction and overwrite Index

craig.topper added inline comments.Sep 1 2023, 9:10 AM

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
160	Do we need to check `skipLoop` for opt-bisect-limit?
207	Doesn't being a "preheader" guarantee it's not conditional?
307	This doesn't guarantee the loads are loading i8. The load have their own type and don't have to match the GEP result type.
322	m_Instruction -> m_Specific
327	Isn't IdxA, zext(Index)? So Index must dominate IdxA.

As @kmclaughlin mentioned on D159283 she will be away for a few weeks. However, in the meantime I would like to address some of the comments on this patch related specifically to bug fixes and also update the patch to use the latest version of the intrinsic in D159283. Unfortunately, the only way I can do this is to commandeer the patch temporarily!

david-arm edited the summary of this revision. (Show Details)Sep 7 2023, 5:21 AM

Fixed some bugs found by @craig.topper when recognising the byte mismatch idiom. This also required updating one of the tests in Transforms/LoopIdiom/AArch64/byte-compare-index.ll that was using the wrong index for comparison.
Reinstated the generic loop idiom recognise pass.
Add new patterns to ensure we use incp instead of cntp+add.

Harbormaster completed remote builds in B256787: Diff 556134.Sep 7 2023, 5:32 AM

david-arm marked 3 inline comments as done.Sep 7 2023, 5:32 AM

craig.topper added inline comments.Sep 7 2023, 5:22 PM

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
275	why is this needed?
289	Do we know for sure that WhileBB is the block in the loop? Could EndBB above be the backedge?
295	Do we need to check that TrueBB is the header?
328	The IdxA != IdxB check is identical to the previous if
333	`IdxA` is is a zero extend of `Index` according to the previous if, so doesn't Index always dominate IdxA?

Hi @craig.topper, thanks again for the review comments! I'll take a look at your comments regarding blocks being in the loop and see if there is a problem or not. It's possible that the canonical form of a loop allows us to make certain assumptions, but I'll double check.

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
275	There is no fundamental reason why the checks are needed, but it made the vector implementation of the mismatch algorithm simpler since we didn't have to worry about poison during signed or unsigned overflow. For the cases we were interested in (unsigned 32-bit addition in C) there were no nsw or nuw flags so we thought for now we'd restrict it to just these cases. It probably makes sense to relax this restriction in future, but it will require carefully rewriting the vectorised implementation to be safe with regards poison/overflow, and ensuring there are no performance regressions for the loops we care about.

Added more checks that the icmp predicates are correct (EQ) and ensure that the true/false block ordering for the branches are what we expect.
Added more negative test cases for bad icmps, bad branches and bad load types.

Harbormaster completed remote builds in B256854: Diff 556252.Sep 8 2023, 6:41 AM

Again, I've only addressed bug fixes in this new update - I'll let @kmclaughlin deal with any other comments once she is back!

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
295	Both this and the above comment about WhileBB are excellent spots. I've fixed these now - thanks @craig.topper. :)

craig.topper added inline comments.Sep 8 2023, 10:28 AM

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
275	Isn’t it always safe to drop the flags if needed?

SjoerdMeijer added a subscriber: SjoerdMeijer.Sep 13 2023, 1:37 AM

craig.topper added inline comments.Sep 18 2023, 11:35 AM

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
356	mention of "call" and "callsite" here. there was no call involved in the original code.
378	Why do we only update DT for this block and not the others?
568	Why do we need a phi if the incoming values are the same?

craig.topper added inline comments.Sep 19 2023, 10:57 AM

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
516	Do we need to check for inBounds on the original GEPs before we can set it here?
521	Do we need to check for inBounds on the original GEPs before we can set it here?

craig.topper added inline comments.Sep 19 2023, 11:02 AM

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
160	Maybe skipLoop is handled directly by the new pass manager? I'm too used to old pass manager.

Only mark the new GEPs as 'inbounds' if the original GEPs were too.
Update the dominator tree for all newly inserted blocks.
Remove pointless PHI in scalar loop preheader block.

Herald added a subscriber: fedor.sergeev. · View Herald TranscriptSep 22 2023, 1:26 AM

Thanks @craig.topper for spotting the bugs with the dominator tree and setting GEPs inbound. I've fixed those, plus removed the redundant PHI from the scalar loop preheader. I still see the same performance improvements for the loops we care about. I realise I haven't addressed all of your comments - we will try to address them later!

Harbormaster completed remote builds in B257528: Diff 557230.Sep 22 2023, 2:23 AM

Does the new pass need to check that SVE is enabled before doing the transform?

kmclaughlin mentioned this in rG3b786f2c7608: [AArch64] Add intrinsic to count trailing zero elements.Oct 31 2023, 3:48 AM

Rebased the patch to reduce the diff.
Added checks so that we only attempt the transformation if the target supports scalable vectors and we know the minimum page size.
Renamed the class to AArch64LoopIdiomTransform to better reflect what the pass is doing, i.e. transforming an idiom from one form to another.
Removed some of the pipeline changes that are no longer necessary.
Added a new RUN line to byte-compare-index.ll to show that in the absence of SVE we don't do the transform.

In D158291#4652947, @craig.topper wrote:

Does the new pass need to check that SVE is enabled before doing the transform?

Hi @craig.topper, good point! I've added a check that the target supports scalable vectors and that we know the minimum page size. Although the pass currently lives in lib/Target/AArch64 it is generic enough that it could be moved into a common directory and used by other targets.

It might make sense to move this patch into github soon for a full review, even though I prefer Phabricator. :)

Harbormaster completed remote builds in B258061: Diff 558082.Nov 13 2023, 11:05 AM

Address more review comments

david-arm marked 7 inline comments as done.Nov 14 2023, 6:01 AM

david-arm added inline comments.

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
160	For the legacy pass manager we do!
207	You're right. I've simplified the logic here to assume a canonical form, particularly since we rejected loops without preheaders in AArch64LoopIdiomRecognize::run

Harbormaster completed remote builds in B258071: Diff 558096.Nov 14 2023, 7:03 AM

GitHub <noreply@github.com> mentioned this in rGc7148467fc08: [AArch64] Add an AArch64 pass for loop idiom transformations (#72273).Mon, Jan 15, 1:22 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

4 lines

IR/

Intrinsics.td

5 lines

lib/

CodeGen/

SelectionDAG/

SelectionDAGBuilder.cpp

58 lines

Passes/

PassBuilderPipelines.cpp

5 lines

Target/

AArch64/

AArch64.h

1 line

AArch64ISelLowering.h

2 lines

AArch64ISelLowering.cpp

18 lines

AArch64LoopIdiomRecognize.h

25 lines

AArch64LoopIdiomRecognize.cpp

701 lines

AArch64SVEInstrInfo.td

16 lines

AArch64TargetMachine.h

3 lines

AArch64TargetMachine.cpp

10 lines

CMakeLists.txt

1 line

test/

CodeGen/

AArch64/

intrinsic-cttz-elts.ll

309 lines

Other/

new-pm-defaults.ll

2 lines

Transforms/

LoopIdiom/

AArch64/

byte-compare-index.ll

1027 lines

PhaseOrdering/

ARM/

arm_mean_q7.ll

14 lines

utils/

gn/

secondary/

llvm/

lib/

Target/

AArch64/

BUILD.gn

1 line

Diff 556134

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 459 Lines • ▼ Show 20 Lines	virtual bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const {
return true;		return true;
}		}

virtual bool shouldExpandGetVectorLength(EVT CountVT, unsigned VF,		virtual bool shouldExpandGetVectorLength(EVT CountVT, unsigned VF,
bool IsScalable) const {		bool IsScalable) const {
return true;		return true;
}		}

		/// Return true if the @llvm.experimental.cttz.elts intrinsic should be
		/// expanded using generic code in SelectionDAGBuilder.
		virtual bool shouldExpandCttzElements(EVT VT) const { return true; }

// Return true if op(vecreduce(x), vecreduce(y)) should be reassociated to		// Return true if op(vecreduce(x), vecreduce(y)) should be reassociated to
// vecreduce(op(x, y)) for the reduction opcode RedOpc.		// vecreduce(op(x, y)) for the reduction opcode RedOpc.
virtual bool shouldReassociateReduction(unsigned RedOpc, EVT VT) const {		virtual bool shouldReassociateReduction(unsigned RedOpc, EVT VT) const {
return true;		return true;
}		}

/// Return true if it is profitable to convert a select of FP constants into		/// Return true if it is profitable to convert a select of FP constants into
/// a constant pool load whose address depends on the select condition. The		/// a constant pool load whose address depends on the select condition. The
▲ Show 20 Lines • Show All 4,890 Lines • Show Last 20 Lines

llvm/include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 2,175 Lines • ▼ Show 20 Lines	def int_experimental_vp_splice:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],		DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>,		[LLVMMatchType<0>,
LLVMMatchType<0>,		LLVMMatchType<0>,
llvm_i32_ty,		llvm_i32_ty,
LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,		LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,
llvm_i32_ty, llvm_i32_ty],		llvm_i32_ty, llvm_i32_ty],
[IntrNoMem, ImmArg<ArgIndex<2>>]>;		[IntrNoMem, ImmArg<ArgIndex<2>>]>;

def int_vp_is_fpclass:		def int_vp_is_fpclass:
		craig.topperUnsubmitted Not Done Reply Inline Actions I wonder if something like "find first nonzero element" would be better? craig.topper: I wonder if something like "find first nonzero element" would be better?
		kmclaughlinUnsubmitted Done Reply Inline Actions This was something we considered, but we wanted to add an intrinsic which mirrors the behaviour of the existing cttz intrinsic. In D159283 I've added a second operand to indicate whether the result is poison if the first argument is all zero, similar to cttz. kmclaughlin: This was something we considered, but we wanted to add an intrinsic which mirrors the behaviour…
DefaultAttrsIntrinsic<[ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],		DefaultAttrsIntrinsic<[ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
[ llvm_anyvector_ty,		[ llvm_anyvector_ty,
llvm_i32_ty,		llvm_i32_ty,
LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,		LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,
llvm_i32_ty],		llvm_i32_ty],
[IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>]>;		[IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>]>;

		def int_experimental_cttz_elts:
		DefaultAttrsIntrinsic<[llvm_anyint_ty],
		[llvm_anyvector_ty, llvm_i32_ty],
		[IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>]>;

//===-------------------------- Masked Intrinsics -------------------------===//		//===-------------------------- Masked Intrinsics -------------------------===//
//		//
def int_masked_load:		def int_masked_load:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],		DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[llvm_anyptr_ty, llvm_i32_ty,		[llvm_anyptr_ty, llvm_i32_ty,
LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],		LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
[IntrReadMem, IntrArgMemOnly, IntrWillReturn, ImmArg<ArgIndex<1>>,		[IntrReadMem, IntrArgMemOnly, IntrWillReturn, ImmArg<ArgIndex<1>>,
NoCapture<ArgIndex<0>>]>;		NoCapture<ArgIndex<0>>]>;
▲ Show 20 Lines • Show All 365 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,484 Lines • ▼ Show 20 Lines	case Intrinsic::experimental_get_vector_length: {

SDValue UMin = DAG.getNode(ISD::UMIN, sdl, CountVT, Count, MaxEVL);		SDValue UMin = DAG.getNode(ISD::UMIN, sdl, CountVT, Count, MaxEVL);
// Clip to the result type if needed.		// Clip to the result type if needed.
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, sdl, VT, UMin);		SDValue Trunc = DAG.getNode(ISD::TRUNCATE, sdl, VT, UMin);

setValue(&I, Trunc);		setValue(&I, Trunc);
return;		return;
}		}
		case Intrinsic::experimental_cttz_elts: {
		auto DL = getCurSDLoc();
		SDValue Op = getValue(I.getOperand(0));
		EVT OpVT = Op.getValueType();

		if (!TLI.shouldExpandCttzElements(OpVT)) {
		visitTargetIntrinsic(I, Intrinsic);
		return;
		}

		if (OpVT.getScalarType() != MVT::i1) {
		// Compare the input vector elements to zero & use to count trailing zeros
		SDValue AllZero = DAG.getConstant(0, DL, OpVT);
		OpVT = EVT::getVectorVT(*DAG.getContext(), MVT::i1,
		craig.topperUnsubmitted Done Reply Inline Actions changeVectorElementType doesn't work if the source type is an MVT and the resulting type is not an MVT. Probably better to use getVectorVT. There have been two recent bug fixes for something like this https://reviews.llvm.org/D157392 and 512a6c50e87c1956c028daf3317b07b3aa0e309f craig.topper: changeVectorElementType doesn't work if the source type is an MVT and the resulting type is not…
		kmclaughlinUnsubmitted Done Reply Inline Actions Thank you @craig.topper, this has been addressed in D159283. kmclaughlin: Thank you @craig.topper, this has been addressed in D159283.
		OpVT.getVectorElementCount());
		Op = DAG.getSetCC(DL, OpVT, Op, AllZero, ISD::SETNE);
		}

		// Find the smallest "sensible" element type to use for the expansion.
		ConstantRange CR(
		APInt(64, OpVT.getVectorElementCount().getKnownMinValue()));
		if (OpVT.isScalableVT())
		CR = CR.umul_sat(getVScaleRange(I.getCaller(), 64));

		unsigned EltWidth = I.getType()->getScalarSizeInBits();
		EltWidth = std::min(EltWidth, (unsigned)CR.getActiveBits());
		EltWidth = std::max(llvm::bit_ceil(EltWidth), (unsigned)8);

		efriedmaUnsubmitted Not Done Reply Inline Actions Is the upper bound here guaranteed to fit into an i64? efriedma: Is the upper bound here guaranteed to fit into an i64?
		kmclaughlinUnsubmitted Done Reply Inline Actions Hi @eli.friedman, I believe it is safe to assume that the upper bound will fit into an i64. Calls to the vscale intrinsic are often 64 bits when generated by the vectoriser, and the getVScaleRangeMin/getVScaleRangeMax functions themselves are returning unsigned types. kmclaughlin: Hi @eli.friedman, I believe it is safe to assume that the upper bound will fit into an i64.
		MVT NewEltTy = MVT::getIntegerVT(EltWidth);

		// Create the new vector type & get the vector length
		EVT NewVT = EVT::getVectorVT(*DAG.getContext(), NewEltTy,
		OpVT.getVectorElementCount());

		efriedmaUnsubmitted Not Done Reply Inline Actions This multiply can overflow? efriedma: This multiply can overflow?
		kmclaughlinUnsubmitted Done Reply Inline Actions I've tried to address this on D159283 by calculating the smallest possible type using the `umul_sat` operation of ConstantRange. kmclaughlin: I've tried to address this on D159283 by calculating the smallest possible type using the…
		SDValue VL =
		efriedmaUnsubmitted Not Done Reply Inline Actions We never want to increase EltWidth beyond the width of the result, I think? (If width of the vector doesn't fit into the return value of cttz.elts, is the result poison, or something else?) efriedma: We never want to increase EltWidth beyond the width of the result, I think? (If width of the…
		DAG.getElementCount(DL, NewEltTy, OpVT.getVectorElementCount());

		SDValue StepVec = DAG.getStepVector(DL, NewVT);
		SDValue SplatVL = DAG.getSplat(NewVT, DL, VL);
		SDValue StepVL = DAG.getNode(ISD::SUB, DL, NewVT, SplatVL, StepVec);
		SDValue Ext = DAG.getNode(ISD::SIGN_EXTEND, DL, NewVT, Op);
		SDValue And = DAG.getNode(ISD::AND, DL, NewVT, StepVL, Ext);
		SDValue Max = DAG.getNode(ISD::VECREDUCE_UMAX, DL, NewEltTy, And);
		SDValue Sub = DAG.getNode(ISD::SUB, DL, NewEltTy, VL, Max);

		// If the result is VL, then the input was all zero. Return UNDEF in this
		// case if zero-is-poison is set.
		if (cast<ConstantSDNode>(getValue(I.getOperand(1)))->getZExtValue() != 0) {
		Sub =
		DAG.getSelectCC(DL, Sub, VL, DAG.getUNDEF(NewEltTy), Sub, ISD::SETEQ);
		}
		efriedmaUnsubmitted Not Done Reply Inline Actions Is there some reason to use SMAX instead of UMAX? It seems to complicate reasoning about the sign bit. efriedma: Is there some reason to use SMAX instead of UMAX? It seems to complicate reasoning about the…
		kmclaughlinUnsubmitted Done Reply Inline Actions There was no reason for choosing SMAX, I have updated this to use UMAX instead. kmclaughlin: There was no reason for choosing SMAX, I have updated this to use UMAX instead.

		EVT RetTy = TLI.getValueType(DAG.getDataLayout(), I.getType());
		SDValue Ret = DAG.getNode(ISD::ZERO_EXTEND, DL, RetTy, Sub);

		setValue(&I, Ret);
		return;
		}
case Intrinsic::vector_insert: {		case Intrinsic::vector_insert: {
SDValue Vec = getValue(I.getOperand(0));		SDValue Vec = getValue(I.getOperand(0));
SDValue SubVec = getValue(I.getOperand(1));		SDValue SubVec = getValue(I.getOperand(1));
SDValue Index = getValue(I.getOperand(2));		SDValue Index = getValue(I.getOperand(2));

// The intrinsic's index type is i64, but the SDNode requires an index type		// The intrinsic's index type is i64, but the SDNode requires an index type
// suitable for the target. Convert the index as required.		// suitable for the target. Convert the index as required.
MVT VectorIdxTy = TLI.getVectorIdxTy(DAG.getDataLayout());		MVT VectorIdxTy = TLI.getVectorIdxTy(DAG.getDataLayout());
▲ Show 20 Lines • Show All 4,532 Lines • Show Last 20 Lines

llvm/lib/Passes/PassBuilderPipelines.cpp

Show First 20 Lines • Show All 430 Lines • ▼ Show 20 Lines	PassBuilder::buildO1FunctionSimplificationPipeline(OptimizationLevel Level,
// TODO: Investigate promotion cap for O1.		// TODO: Investigate promotion cap for O1.
LPM1.addPass(LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap,		LPM1.addPass(LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap,
/AllowSpeculation=/true));		/AllowSpeculation=/true));
LPM1.addPass(SimpleLoopUnswitchPass());		LPM1.addPass(SimpleLoopUnswitchPass());
if (EnableLoopFlatten)		if (EnableLoopFlatten)
LPM1.addPass(LoopFlattenPass());		LPM1.addPass(LoopFlattenPass());

LPM2.addPass(LoopIdiomRecognizePass());		LPM2.addPass(LoopIdiomRecognizePass());
LPM2.addPass(IndVarSimplifyPass());

invokeLateLoopOptimizationsEPCallbacks(LPM2, Level);		invokeLateLoopOptimizationsEPCallbacks(LPM2, Level);

		LPM2.addPass(IndVarSimplifyPass());

LPM2.addPass(LoopDeletionPass());		LPM2.addPass(LoopDeletionPass());

if (EnableLoopInterchange)		if (EnableLoopInterchange)
LPM2.addPass(LoopInterchangePass());		LPM2.addPass(LoopInterchangePass());

// Do not enable unrolling in PreLinkThinLTO phase during sample PGO		// Do not enable unrolling in PreLinkThinLTO phase during sample PGO
// because it changes IR to makes profile annotation in back compile		// because it changes IR to makes profile annotation in back compile
// inaccurate. The normal unroller doesn't pay attention to forced full unroll		// inaccurate. The normal unroller doesn't pay attention to forced full unroll
▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
if (EnableLoopFlatten)		if (EnableLoopFlatten)
LPM1.addPass(LoopFlattenPass());		LPM1.addPass(LoopFlattenPass());

LPM2.addPass(LoopIdiomRecognizePass());		LPM2.addPass(LoopIdiomRecognizePass());
LPM2.addPass(IndVarSimplifyPass());		LPM2.addPass(IndVarSimplifyPass());

invokeLateLoopOptimizationsEPCallbacks(LPM2, Level);		invokeLateLoopOptimizationsEPCallbacks(LPM2, Level);

		LPM2.addPass(IndVarSimplifyPass());
		efriedmaUnsubmitted Not Done Reply Inline Actions Did you really mean to remove LoopIdiomRecognize and replace it with a second run of IndVarSimplify? I'm not sure why this patch requires messing with the default pass pipeline. efriedma: Did you really mean to remove LoopIdiomRecognize and replace it with a second run of…
		kmclaughlinUnsubmitted Done Reply Inline Actions Removing the LoopIdiomRecognize is a mistake, I only intended to move the IndVarSimplify pass after `invokeLateLoopOptimizationsEPCallbacks` so that the new pass runs as close to LoopIdiomRecognize as possible. kmclaughlin: Removing the LoopIdiomRecognize is a mistake, I only intended to move the IndVarSimplify pass…

LPM2.addPass(LoopDeletionPass());		LPM2.addPass(LoopDeletionPass());

if (EnableLoopInterchange)		if (EnableLoopInterchange)
LPM2.addPass(LoopInterchangePass());		LPM2.addPass(LoopInterchangePass());

// Do not enable unrolling in PreLinkThinLTO phase during sample PGO		// Do not enable unrolling in PreLinkThinLTO phase during sample PGO
// because it changes IR to makes profile annotation in back compile		// because it changes IR to makes profile annotation in back compile
// inaccurate. The normal unroller doesn't pay attention to forced full unroll		// inaccurate. The normal unroller doesn't pay attention to forced full unroll
▲ Show 20 Lines • Show All 1,429 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64.h

	Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines
	void initializeAArch64CondBrTuningPass(PassRegistry &);			void initializeAArch64CondBrTuningPass(PassRegistry &);
	void initializeAArch64ConditionOptimizerPass(PassRegistry&);			void initializeAArch64ConditionOptimizerPass(PassRegistry&);
	void initializeAArch64ConditionalComparesPass(PassRegistry &);			void initializeAArch64ConditionalComparesPass(PassRegistry &);
	void initializeAArch64DAGToDAGISelPass(PassRegistry &);			void initializeAArch64DAGToDAGISelPass(PassRegistry &);
	void initializeAArch64DeadRegisterDefinitionsPass(PassRegistry&);			void initializeAArch64DeadRegisterDefinitionsPass(PassRegistry&);
	void initializeAArch64ExpandPseudoPass(PassRegistry &);			void initializeAArch64ExpandPseudoPass(PassRegistry &);
	void initializeAArch64GlobalsTaggingPass(PassRegistry &);			void initializeAArch64GlobalsTaggingPass(PassRegistry &);
	void initializeAArch64LoadStoreOptPass(PassRegistry&);			void initializeAArch64LoadStoreOptPass(PassRegistry&);
				void initializeAArch64LoopIdiomRecognizeLegacyPassPass(PassRegistry &);
	void initializeAArch64LowerHomogeneousPrologEpilogPass(PassRegistry &);			void initializeAArch64LowerHomogeneousPrologEpilogPass(PassRegistry &);
	void initializeAArch64MIPeepholeOptPass(PassRegistry &);			void initializeAArch64MIPeepholeOptPass(PassRegistry &);
	void initializeAArch64O0PreLegalizerCombinerPass(PassRegistry &);			void initializeAArch64O0PreLegalizerCombinerPass(PassRegistry &);
	void initializeAArch64PostLegalizerCombinerPass(PassRegistry &);			void initializeAArch64PostLegalizerCombinerPass(PassRegistry &);
	void initializeAArch64PostLegalizerLoweringPass(PassRegistry &);			void initializeAArch64PostLegalizerLoweringPass(PassRegistry &);
	void initializeAArch64PostSelectOptimizePass(PassRegistry &);			void initializeAArch64PostSelectOptimizePass(PassRegistry &);
	void initializeAArch64PreLegalizerCombinerPass(PassRegistry &);			void initializeAArch64PreLegalizerCombinerPass(PassRegistry &);
	void initializeAArch64PromoteConstantPass(PassRegistry&);			void initializeAArch64PromoteConstantPass(PassRegistry&);
	Show All 15 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 914 Lines • ▼ Show 20 Lines	public:
bool isAllActivePredicate(SelectionDAG &DAG, SDValue N) const;		bool isAllActivePredicate(SelectionDAG &DAG, SDValue N) const;
EVT getPromotedVTForPredicate(EVT VT) const;		EVT getPromotedVTForPredicate(EVT VT) const;

EVT getAsmOperandValueType(const DataLayout &DL, Type *Ty,		EVT getAsmOperandValueType(const DataLayout &DL, Type *Ty,
bool AllowUnknown = false) const override;		bool AllowUnknown = false) const override;

bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const override;		bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const override;

		bool shouldExpandCttzElements(EVT VT) const override;

/// If a change in streaming mode is required on entry to/return from a		/// If a change in streaming mode is required on entry to/return from a
/// function call it emits and returns the corresponding SMSTART or SMSTOP node.		/// function call it emits and returns the corresponding SMSTART or SMSTOP node.
/// \p Entry tells whether this is before/after the Call, which is necessary		/// \p Entry tells whether this is before/after the Call, which is necessary
/// because PSTATE.SM is only queried once.		/// because PSTATE.SM is only queried once.
SDValue changeStreamingMode(SelectionDAG &DAG, SDLoc DL, bool Enable,		SDValue changeStreamingMode(SelectionDAG &DAG, SDLoc DL, bool Enable,
SDValue Chain, SDValue InGlue,		SDValue Chain, SDValue InGlue,
SDValue PStateSM, bool Entry) const;		SDValue PStateSM, bool Entry) const;

▲ Show 20 Lines • Show All 322 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,759 Lines • ▼ Show 20 Lines	bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT,

// The whilelo instruction only works with i32 or i64 scalar inputs.		// The whilelo instruction only works with i32 or i64 scalar inputs.
if (OpVT != MVT::i32 && OpVT != MVT::i64)		if (OpVT != MVT::i32 && OpVT != MVT::i64)
return true;		return true;

return false;		return false;
}		}

		bool AArch64TargetLowering::shouldExpandCttzElements(EVT VT) const {
		if (!Subtarget->hasSVE() \|\| VT != MVT::nxv16i1)
		return true;

		return false;
		}

void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT,		void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT,
bool StreamingSVE) {		bool StreamingSVE) {
assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");		assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");

// By default everything must be expanded.		// By default everything must be expanded.
for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)		for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)
setOperationAction(Op, VT, Expand);		setOperationAction(Op, VT, Expand);

▲ Show 20 Lines • Show All 3,542 Lines • ▼ Show 20 Lines	return DAG.getNode(Opcode, dl, Op.getValueType(), Op.getOperand(1),
Op.getOperand(2), Op.getOperand(3));		Op.getOperand(2), Op.getOperand(3));
}		}
case Intrinsic::get_active_lane_mask: {		case Intrinsic::get_active_lane_mask: {
SDValue ID =		SDValue ID =
DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);		DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);
return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(), ID,		return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(), ID,
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
}		}
		case Intrinsic::experimental_cttz_elts: {
		EVT Ty = Op.getValueType();
		if (Ty == MVT::i64)
		return Op;

		SDValue NewCttzElts =
		DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::i64, Op.getOperand(0),
		Op.getOperand(1), Op.getOperand(2));

		return DAG.getZExtOrTrunc(NewCttzElts, dl, Ty);
		}
}		}
}		}

bool AArch64TargetLowering::shouldExtendGSIndex(EVT VT, EVT &EltTy) const {		bool AArch64TargetLowering::shouldExtendGSIndex(EVT VT, EVT &EltTy) const {
if (VT.getVectorElementType() == MVT::i8 \|\|		if (VT.getVectorElementType() == MVT::i8 \|\|
VT.getVectorElementType() == MVT::i16) {		VT.getVectorElementType() == MVT::i16) {
EltTy = MVT::i32;		EltTy = MVT::i32;
return true;		return true;
▲ Show 20 Lines • Show All 20,861 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.h

This file was added.

				//===- AArch64LoopIdiomRecognize.h --------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_TARGET_AARCH64_AARCH64LOOPIDIOMRECOGNIZE_H
				#define LLVM_LIB_TARGET_AARCH64_AARCH64LOOPIDIOMRECOGNIZE_H

				#include "llvm/IR/PassManager.h"
				#include "llvm/Transforms/Scalar/LoopPassManager.h"

				namespace llvm {

				struct AArch64LoopIdiomRecognizePass
				: PassInfoMixin<AArch64LoopIdiomRecognizePass> {
				PreservedAnalyses run(Loop &L, LoopAnalysisManager &AM,
				LoopStandardAnalysisResults &AR, LPMUpdater &U);
				};

				} // namespace llvm

				#endif // LLVM_LIB_TARGET_AARCH64_AARCH64LOOPIDIOMRECOGNIZE_H

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp

This file was added.


				//===- AArch64LoopIdiomRecognize.cpp - Loop idiom recognition -------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "AArch64LoopIdiomRecognize.h"
				#include "llvm/Analysis/DomTreeUpdater.h"
				#include "llvm/Analysis/LoopPass.h"
				#include "llvm/Analysis/TargetLibraryInfo.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/IntrinsicsAArch64.h"
				#include "llvm/IR/MDBuilder.h"
				#include "llvm/IR/PatternMatch.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"

				using namespace llvm;

				#define DEBUG_TYPE "aarch64-lir"

				static cl::opt<bool>
				DisableAll("disable-aarch64-lir-all", cl::Hidden, cl::init(false),
				cl::desc("Disable AArch64 Loop Idiom Recognize Pass."));

				static cl::opt<bool> DisableByteCmp(
				"disable-aarch64-lir-bytecmp", cl::Hidden, cl::init(false),
				cl::desc("Proceed with AArch64 Loop Idiom Recognize Pass, but do "
				"not convert byte-compare loop(s)."));

				namespace llvm {

				void initializeAArch64LoopIdiomRecognizeLegacyPassPass(PassRegistry &);
				Pass *createAArch64LoopIdiomPass();

				} // end namespace llvm

				namespace {

				class AArch64LoopIdiomRecognize {
				Loop *CurLoop = nullptr;
				DominatorTree *DT;
				LoopInfo *LI;
				TargetLibraryInfo *TLI;
				const TargetTransformInfo *TTI;
				const DataLayout *DL;

				public:
				explicit AArch64LoopIdiomRecognize(DominatorTree DT, LoopInfo LI,
				TargetLibraryInfo *TLI,
				const TargetTransformInfo *TTI,
				const DataLayout *DL)
				: DT(DT), LI(LI), TLI(TLI), TTI(TTI), DL(DL) {}

				bool run(Loop *L);

				private:
				/// \name Countable Loop Idiom Handling
				/// @{

				bool runOnCountableLoop();
				bool runOnLoopBlock(BasicBlock BB, const SCEV BECount,
				SmallVectorImpl<BasicBlock *> &ExitBlocks);

				bool recognizeByteCompare();
				Value expandFindMismatch(IRBuilder<> &Builder, Value PtrA, Value *PtrB,
				Value Start, Value MaxLen);
				void transformByteCompare(Value PtrA, Value PtrB, Value *MaxLen,
				Value Index, Value Start, bool IncIdx,
				BasicBlock FoundBB, BasicBlock EndBB);

				/// @}
				};

				class AArch64LoopIdiomRecognizeLegacyPass : public LoopPass {
				public:
				static char ID;

				explicit AArch64LoopIdiomRecognizeLegacyPass() : LoopPass(ID) {
				initializeAArch64LoopIdiomRecognizeLegacyPassPass(
				*PassRegistry::getPassRegistry());
				}

				StringRef getPassName() const override {
				return "Recognize AArch64-specific loop idioms";
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<LoopInfoWrapperPass>();
				AU.addRequired<DominatorTreeWrapperPass>();
				AU.addRequired<TargetLibraryInfoWrapperPass>();
				AU.addPreserved<TargetLibraryInfoWrapperPass>();
				AU.addRequired<TargetTransformInfoWrapperPass>();
				}

				bool runOnLoop(Loop *L, LPPassManager &LPM) override;
				};

				bool AArch64LoopIdiomRecognizeLegacyPass::runOnLoop(Loop *L,
				LPPassManager &LPM) {

				auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
				auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
				auto *TLI = &getAnalysis<TargetLibraryInfoWrapperPass>().getTLI(
				*L->getHeader()->getParent());
				auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(
				*L->getHeader()->getParent());
				return AArch64LoopIdiomRecognize(
				DT, LI, TLI, &TTI, &L->getHeader()->getModule()->getDataLayout())
				.run(L);
				}

				} // end anonymous namespace

				char AArch64LoopIdiomRecognizeLegacyPass::ID = 0;

				INITIALIZE_PASS_BEGIN(AArch64LoopIdiomRecognizeLegacyPass, "aarch64-lir",
				"Recognize AArch64-specific loop idioms", false, false)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(LoopSimplify)
				INITIALIZE_PASS_DEPENDENCY(LCSSAWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
				INITIALIZE_PASS_END(AArch64LoopIdiomRecognizeLegacyPass, "aarch64-lir",
				"Recognize AArch64-specific loop idioms", false, false)

				Pass *llvm::createAArch64LoopIdiomPass() {
				return new AArch64LoopIdiomRecognizeLegacyPass();
				}

				PreservedAnalyses
				AArch64LoopIdiomRecognizePass::run(Loop &L, LoopAnalysisManager &AM,
				LoopStandardAnalysisResults &AR,
				LPMUpdater &) {
				if (DisableAll)
				return PreservedAnalyses::all();

				const auto *DL = &L.getHeader()->getModule()->getDataLayout();

				AArch64LoopIdiomRecognize LIR(&AR.DT, &AR.LI, &AR.TLI, &AR.TTI, DL);
				if (!LIR.run(&L))
				return PreservedAnalyses::all();

				return PreservedAnalyses::none();
				}

				//===----------------------------------------------------------------------===//
				//
				// Implementation of AArch64LoopIdiomRecognize
				//
				//===----------------------------------------------------------------------===//

				bool AArch64LoopIdiomRecognize::run(Loop *L) {
				CurLoop = L;
				craig.topperUnsubmitted Done Reply Inline Actions Do we need to check `skipLoop` for opt-bisect-limit? craig.topper: Do we need to check `skipLoop` for opt-bisect-limit?
				craig.topperUnsubmitted Done Reply Inline Actions Maybe skipLoop is handled directly by the new pass manager? I'm too used to old pass manager. craig.topper: Maybe skipLoop is handled directly by the new pass manager? I'm too used to old pass manager.
				david-armAuthorUnsubmitted Done Reply Inline Actions For the legacy pass manager we do! david-arm: For the legacy pass manager we do!

				if (DisableAll)
				return false;

				// If the loop could not be converted to canonical form, it must have an
				// indirectbr in it, just give up.
				if (!L->getLoopPreheader())
				return false;

				LLVM_DEBUG(dbgs() << DEBUG_TYPE " Scanning: F["
				<< CurLoop->getHeader()->getParent()->getName()
				<< "] Loop %" << CurLoop->getHeader()->getName() << "\n");

				return recognizeByteCompare();
				}

				/// Match loop-invariant value.
				template <typename SubPattern_t> struct match_LoopInvariant {
				SubPattern_t SubPattern;
				const Loop *L;

				match_LoopInvariant(const SubPattern_t &SP, const Loop *L)
				: SubPattern(SP), L(L) {}

				template <typename ITy> bool match(ITy *V) {
				return L->isLoopInvariant(V) && SubPattern.match(V);
				}
				};

				/// Matches if the value is loop-invariant.
				template <typename Ty>
				inline match_LoopInvariant<Ty> m_LoopInvariant(const Ty &M, const Loop *L) {
				return match_LoopInvariant<Ty>(M, L);
				}

				bool AArch64LoopIdiomRecognize::recognizeByteCompare() {
				if (DisableByteCmp)
				return false;

				BasicBlock *Header = CurLoop->getHeader();
				BasicBlock *PH = CurLoop->getLoopPreheader();

				// The preheader should only contain an unconditional branch.
				if (!PH \|\| &PH->front() != PH->getTerminator())
				return false;
				auto *EntryBI = dyn_cast<BranchInst>(PH->getTerminator());
				if (!EntryBI \|\| EntryBI->isConditional())
				craig.topperUnsubmitted Not Done Reply Inline Actions Doesn't being a "preheader" guarantee it's not conditional? craig.topper: Doesn't being a "preheader" guarantee it's not conditional?
				david-armAuthorUnsubmitted Done Reply Inline Actions You're right. I've simplified the logic here to assume a canonical form, particularly since we rejected loops without preheaders in AArch64LoopIdiomRecognize::run david-arm: You're right. I've simplified the logic here to assume a canonical form, particularly since we…
				return false;

				if (CurLoop->getNumBackEdges() != 1 \|\| CurLoop->getNumBlocks() != 2)
				return false;

				PHINode *PN = dyn_cast<PHINode>(&Header->front());
				if (!PN \|\| PN->getNumIncomingValues() != 2)
				return false;

				auto LoopBlocks = CurLoop->getBlocks();
				// The first block in the loop should contain only 4 instructions, e.g.
				//
				// while.cond:
				// %res.phi = phi i32 [ %start, %ph ], [ %inc, %while.body ]
				// %inc = add i32 %res.phi, 1
				// %cmp.not = icmp eq i32 %inc, %n
				// br i1 %cmp.not, label %while.end, label %while.body
				//
				auto CondBBInsts = LoopBlocks[0]->instructionsWithoutDebug();
				if (std::distance(CondBBInsts.begin(), CondBBInsts.end()) > 4)
				return false;

				// The second block should contain 7 instructions, e.g.
				//
				// while.body:
				// %idx = zext i32 %inc to i64
				// %idx.a = getelementptr inbounds i8, ptr %a, i64 %idx
				// %load.a = load i8, ptr %idx.a
				// %idx.b = getelementptr inbounds i8, ptr %b, i64 %idx
				// %load.b = load i8, ptr %idx.b
				// %cmp.not.ld = icmp eq i8 %load.a, %load.b
				// br i1 %cmp.not.ld, label %while.cond, label %while.end
				//
				auto LoopBBInsts = LoopBlocks[1]->instructionsWithoutDebug();
				if (std::distance(LoopBBInsts.begin(), LoopBBInsts.end()) > 7)
				return false;

				using namespace PatternMatch;

				// The incoming value to the PHI node from the loop should be an add of 1.
				Instruction *Index = nullptr;
				Value *StartIdx = nullptr;
				for (BasicBlock *BB : PN->blocks()) {
				if (!CurLoop->contains(BB)) {
				StartIdx = PN->getIncomingValueForBlock(BB);
				continue;
				}
				Index = dyn_cast<Instruction>(PN->getIncomingValueForBlock(BB));
				// Limit to 32-bit types for now
				if (!Index \|\| !Index->getType()->isIntegerTy(32) \|\|
				!match(Index, m_c_Add(m_Specific(PN), m_One())))
				return false;
				}

				// If we match the pattern, PN and Index will be replaced with the result of
				// the cttz.elts intrinsic. If any other instructions are used outside of
				// the loop, we cannot replace it.
				for (BasicBlock *BB : LoopBlocks)
				for (Instruction &I : *BB)
				if (&I != PN && &I != Index)
				for (User *U : I.users()) {
				auto UI = dyn_cast<Instruction>(U);
				if (!CurLoop->contains(UI))
				return false;
				}

				// Don't replace the loop if the add has a wrap flag.
				if (Index->hasNoSignedWrap() \|\| Index->hasNoUnsignedWrap())
				craig.topperUnsubmitted Not Done Reply Inline Actions why is this needed? craig.topper: why is this needed?
				david-armAuthorUnsubmitted Done Reply Inline Actions There is no fundamental reason why the checks are needed, but it made the vector implementation of the mismatch algorithm simpler since we didn't have to worry about poison during signed or unsigned overflow. For the cases we were interested in (unsigned 32-bit addition in C) there were no nsw or nuw flags so we thought for now we'd restrict it to just these cases. It probably makes sense to relax this restriction in future, but it will require carefully rewriting the vectorised implementation to be safe with regards poison/overflow, and ensuring there are no performance regressions for the loops we care about. david-arm: There is no fundamental reason why the checks are needed, but it made the vector implementation…
				craig.topperUnsubmitted Not Done Reply Inline Actions Isn’t it always safe to drop the flags if needed? craig.topper: Isn’t it always safe to drop the flags if needed?
				return false;

				// Match the branch instruction for the header
				ICmpInst::Predicate Pred;
				Value *MaxLen;
				BasicBlock EndBB, WhileBB;
				if (!match(Header->getTerminator(),
				m_Br(m_ICmp(Pred, m_Specific(Index), m_Value(MaxLen)),
				craig.topperUnsubmitted Done Reply Inline Actions I think you want m_Specific(Index) instead of m_Instruction. m_Instruction will match any instruction and overwrite Index craig.topper: I think you want m_Specific(Index) instead of m_Instruction. m_Instruction will match any…
				m_BasicBlock(EndBB), m_BasicBlock(WhileBB))))
				return false;

				// WhileBB should contain the pattern of load & compare instructions. Match
				// the pattern and find the GEP instructions used by the loads.
				ICmpInst::Predicate WhilePred;
				craig.topperUnsubmitted Done Reply Inline Actions Do we know for sure that WhileBB is the block in the loop? Could EndBB above be the backedge? craig.topper: Do we know for sure that WhileBB is the block in the loop? Could EndBB above be the backedge?
				BasicBlock *FoundBB;
				BasicBlock *TrueBB;
				Value LoadA, LoadB;
				if (!match(WhileBB->getTerminator(),
				m_Br(m_ICmp(WhilePred, m_Value(LoadA), m_Value(LoadB)),
				m_BasicBlock(TrueBB), m_BasicBlock(FoundBB))))
				craig.topperUnsubmitted Done Reply Inline Actions Do we need to check that TrueBB is the header? craig.topper: Do we need to check that TrueBB is the header?
				david-armAuthorUnsubmitted Done Reply Inline Actions Both this and the above comment about WhileBB are excellent spots. I've fixed these now - thanks @craig.topper. :) david-arm: Both this and the above comment about WhileBB are excellent spots. I've fixed these now…
				return false;

				Value A, B;
				if (!match(LoadA, m_Load(m_Value(A))) \|\| !match(LoadB, m_Load(m_Value(B))))
				return false;

				GetElementPtrInst *GEPA = dyn_cast<GetElementPtrInst>(A);
				GetElementPtrInst *GEPB = dyn_cast<GetElementPtrInst>(B);

				if (!GEPA \|\| !GEPB)
				return false;

				craig.topperUnsubmitted Done Reply Inline Actions This doesn't guarantee the loads are loading i8. The load have their own type and don't have to match the GEP result type. craig.topper: This doesn't guarantee the loads are loading i8. The load have their own type and don't have to…
				Value *PtrA = GEPA->getPointerOperand();
				Value *PtrB = GEPB->getPointerOperand();

				// Check we are loading i8 values from two loop invariant pointers
				if (!CurLoop->isLoopInvariant(PtrA) \|\| !CurLoop->isLoopInvariant(PtrB) \|\|
				!GEPA->getResultElementType()->isIntegerTy(8) \|\|
				!GEPB->getResultElementType()->isIntegerTy(8) \|\|
				!cast<LoadInst>(LoadA)->getType()->isIntegerTy(8) \|\|
				!cast<LoadInst>(LoadB)->getType()->isIntegerTy(8) \|\| PtrA == PtrB)
				return false;

				// Check that the index to the GEPs is the index we found earlier
				if (GEPA->getNumIndices() > 1 \|\| GEPB->getNumIndices() > 1)
				return false;

				craig.topperUnsubmitted Done Reply Inline Actions m_Instruction -> m_Specific craig.topper: m_Instruction -> m_Specific
				Value *IdxA = GEPA->getOperand(GEPA->getNumIndices());
				Value *IdxB = GEPB->getOperand(GEPB->getNumIndices());
				if (IdxA != IdxB)
				return false;

				craig.topperUnsubmitted Done Reply Inline Actions Isn't IdxA, zext(Index)? So Index must dominate IdxA. craig.topper: Isn't IdxA, zext(Index)? So Index must dominate IdxA.
				if (IdxA != IdxB \|\| !match(IdxA, m_ZExt(m_Specific(Index))))
				craig.topperUnsubmitted Done Reply Inline Actions The IdxA != IdxB check is identical to the previous if craig.topper: The IdxA != IdxB check is identical to the previous if
				return false;

				// If the index is incremented before the GEP/Load pair, we need to
				// add 1 to the start value.
				bool IncIdx = DT->dominates(Index, cast<Instruction>(IdxA));
				craig.topperUnsubmitted Done Reply Inline Actions `IdxA` is is a zero extend of `Index` according to the previous if, so doesn't Index always dominate IdxA? craig.topper: `IdxA` is is a zero extend of `Index` according to the previous if, so doesn't Index always…

				LLVM_DEBUG(dbgs() << "FOUND IDIOM IN LOOP: \n"
				<< *(EndBB->getParent()) << "\n\n");
				transformByteCompare(PtrA, PtrB, MaxLen, Index, StartIdx, IncIdx, FoundBB,
				EndBB);
				return true;
				}

				Value *AArch64LoopIdiomRecognize::expandFindMismatch(IRBuilder<> &Builder,
				Value PtrA, Value PtrB,
				Value *Start,
				Value *MaxLen) {
				// Get the arguments and types for the intrinsic.
				BasicBlock *Preheader = CurLoop->getLoopPreheader();
				BranchInst *PHBranch = cast<BranchInst>(Preheader->getTerminator());
				LLVMContext &Ctx = PHBranch->getContext();
				Type *LoadType = Type::getInt8Ty(Ctx);
				Type *ResType = Builder.getInt32Ty();

				// Split block at the original callsite, where the EndBlock continues from
				// where the original call ended.
				DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
				BasicBlock *EndBlock =
				craig.topperUnsubmitted Done Reply Inline Actions mention of "call" and "callsite" here. there was no call involved in the original code. craig.topper: mention of "call" and "callsite" here. there was no call involved in the original code.
				SplitBlock(Preheader, PHBranch, DT, LI, nullptr, "mismatch_end");

				// Create the blocks that we're going to need:
				// 1. A block for checking the zero-extended length exceeds 0
				// 2. A block to check that the start and end addresses of a given array
				// lie on the same page.
				// 3. The SVE loop preheader.
				// 4. The first SVE loop block.
				// 5. The SVE loop increment block.
				// 6. A block we can jump to from the SVE loop when a mismatch is found.
				// 7. The first block of the scalar loop itself, containing PHIs , loads
				// and cmp.
				// 8. A scalar loop increment block to increment the PHIs and go back
				// around the loop.

				BasicBlock *MinItCheckBlock = BasicBlock::Create(
				Ctx, "mismatch_min_it_check", EndBlock->getParent(), EndBlock);

				DTU.applyUpdates({{DominatorTree::Insert, Preheader, MinItCheckBlock},
				{DominatorTree::Delete, Preheader, EndBlock}});

				// Update the terminator added by SplitBlock to branch to the first block
				craig.topperUnsubmitted Done Reply Inline Actions Why do we only update DT for this block and not the others? craig.topper: Why do we only update DT for this block and not the others?
				Preheader->getTerminator()->setSuccessor(0, MinItCheckBlock);

				BasicBlock *MemCheckBlock = BasicBlock::Create(
				Ctx, "mismatch_mem_check", EndBlock->getParent(), EndBlock);

				BasicBlock *SVELoopPreheaderBlock = BasicBlock::Create(
				Ctx, "mismatch_sve_loop_preheader", EndBlock->getParent(), EndBlock);

				BasicBlock *SVELoopStartBlock = BasicBlock::Create(
				Ctx, "mismatch_sve_loop", EndBlock->getParent(), EndBlock);

				BasicBlock *SVELoopIncBlock = BasicBlock::Create(
				Ctx, "mismatch_sve_loop_inc", EndBlock->getParent(), EndBlock);

				BasicBlock *SVELoopMismatchBlock = BasicBlock::Create(
				Ctx, "mismatch_sve_loop_found", EndBlock->getParent(), EndBlock);

				BasicBlock *LoopPreHeaderBlock = BasicBlock::Create(
				Ctx, "mismatch_loop_pre", EndBlock->getParent(), EndBlock);

				BasicBlock *LoopStartBlock =
				BasicBlock::Create(Ctx, "mismatch_loop", EndBlock->getParent(), EndBlock);

				BasicBlock *LoopIncBlock = BasicBlock::Create(
				Ctx, "mismatch_loop_inc", EndBlock->getParent(), EndBlock);

				// Update LoopInfo with the new SVE & scalar loops.
				auto SVELoop = LI->AllocateLoop();
				auto ScalarLoop = LI->AllocateLoop();
				if (CurLoop->getParentLoop()) {
				CurLoop->getParentLoop()->addChildLoop(SVELoop);
				CurLoop->getParentLoop()->addChildLoop(ScalarLoop);
				} else {
				LI->addTopLevelLoop(SVELoop);
				LI->addTopLevelLoop(ScalarLoop);
				}

				// Add the new basic blocks to their associated loops.
				SVELoop->addBasicBlockToLoop(MinItCheckBlock, *LI);
				SVELoop->addBasicBlockToLoop(MemCheckBlock, *LI);
				SVELoop->addBasicBlockToLoop(SVELoopPreheaderBlock, *LI);
				SVELoop->addBasicBlockToLoop(SVELoopStartBlock, *LI);
				SVELoop->addBasicBlockToLoop(SVELoopIncBlock, *LI);
				SVELoop->addBasicBlockToLoop(SVELoopMismatchBlock, *LI);

				ScalarLoop->addBasicBlockToLoop(LoopPreHeaderBlock, *LI);
				ScalarLoop->addBasicBlockToLoop(LoopStartBlock, *LI);
				ScalarLoop->addBasicBlockToLoop(LoopIncBlock, *LI);

				// Set up some types and constants that we intend to reuse.
				Type *I64Type = Builder.getInt64Ty();

				// Check the zero-extended iteration count > 0
				Builder.SetInsertPoint(MinItCheckBlock);
				Value *ExtStart = Builder.CreateZExt(Start, I64Type);
				Value *ExtEnd = Builder.CreateZExt(MaxLen, I64Type);
				// This check doesn't really cost us very much.

				Value *LimitCheck = Builder.CreateICmpULE(Start, MaxLen);
				BranchInst *MinItCheckBr =
				BranchInst::Create(MemCheckBlock, LoopPreHeaderBlock, LimitCheck);
				MinItCheckBr->setMetadata(
				LLVMContext::MD_prof,
				MDBuilder(MinItCheckBr->getContext()).createBranchWeights(99, 1));
				Builder.Insert(MinItCheckBr);

				// For each of the arrays, check the start/end addresses are on the same
				// page.
				Builder.SetInsertPoint(MemCheckBlock);

				// For each start address calculate the offset into the min architecturally
				// allowed page size (4096). Then determine how many bytes there are left on
				// the page and see if this is >= MaxLen.
				Value *LhsStartGEP = Builder.CreateGEP(LoadType, PtrA, ExtStart);
				Value *RhsStartGEP = Builder.CreateGEP(LoadType, PtrB, ExtStart);
				Value *RhsStart = Builder.CreatePtrToInt(RhsStartGEP, I64Type);
				Value *LhsStart = Builder.CreatePtrToInt(LhsStartGEP, I64Type);
				Value *LhsEndGEP = Builder.CreateGEP(LoadType, PtrA, ExtEnd);
				Value *RhsEndGEP = Builder.CreateGEP(LoadType, PtrB, ExtEnd);
				Value *LhsEnd = Builder.CreatePtrToInt(LhsEndGEP, I64Type);
				Value *RhsEnd = Builder.CreatePtrToInt(RhsEndGEP, I64Type);
				Value *LhsStartPage = Builder.CreateLShr(LhsStart, uint64_t(12));
				Value *LhsEndPage = Builder.CreateLShr(LhsEnd, uint64_t(12));
				Value *RhsStartPage = Builder.CreateLShr(RhsStart, uint64_t(12));
				Value *RhsEndPage = Builder.CreateLShr(RhsEnd, uint64_t(12));
				Value *LhsPageCmp = Builder.CreateICmpNE(LhsStartPage, LhsEndPage);
				Value *RhsPageCmp = Builder.CreateICmpNE(RhsStartPage, RhsEndPage);

				Value *CombinedPageCmp = Builder.CreateOr(LhsPageCmp, RhsPageCmp);
				BranchInst *CombinedPageCmpCmpBr = BranchInst::Create(
				LoopPreHeaderBlock, SVELoopPreheaderBlock, CombinedPageCmp);
				CombinedPageCmpCmpBr->setMetadata(
				LLVMContext::MD_prof, MDBuilder(CombinedPageCmpCmpBr->getContext())
				.createBranchWeights(10, 90));
				Builder.Insert(CombinedPageCmpCmpBr);

				// Set up the SVE loop preheader, i.e. calculate initial loop predicate,
				// zero-extend MaxLen to 64-bits, determine the number of vector elements
				// processed in each iteration, etc.
				Builder.SetInsertPoint(SVELoopPreheaderBlock);

				// At this point we know two things must be true:
				// 1. Start <= End
				// 2. ExtMaxLen <= 4096 due to the page checks.
				// Therefore, we know that we can use a 64-bit induction variable that
				// starts from 0 -> ExtMaxLen and it will not overflow.
				ScalableVectorType *PredVTy =
				ScalableVectorType::get(Builder.getInt1Ty(), 16);

				Value *InitialPred = Builder.CreateIntrinsic(
				Intrinsic::get_active_lane_mask, {PredVTy, I64Type}, {ExtStart, ExtEnd});

				Value *VecLen = Builder.CreateIntrinsic(Intrinsic::vscale, {I64Type}, {});
				VecLen = Builder.CreateMul(VecLen, ConstantInt::get(I64Type, 16), "",
				/HasNUW=/true, /HasNSW=/true);

				Value *PFalse = Builder.CreateVectorSplat(PredVTy->getElementCount(),
				Builder.getInt1(false));

				BranchInst *JumpToSVELoop = BranchInst::Create(SVELoopStartBlock);
				Builder.Insert(JumpToSVELoop);

				// Set up the first SVE loop block by creating the PHIs, doing the vector
				// loads and comparing the vectors.
				Builder.SetInsertPoint(SVELoopStartBlock);
				PHINode *LoopPred = Builder.CreatePHI(PredVTy, 2, "mismatch_sve_loop_pred");
				LoopPred->addIncoming(InitialPred, SVELoopPreheaderBlock);
				PHINode *SVEIndexPhi = Builder.CreatePHI(I64Type, 2, "mismatch_sve_index");
				SVEIndexPhi->addIncoming(ExtStart, SVELoopPreheaderBlock);
				Type *SVELoadType = ScalableVectorType::get(Builder.getInt8Ty(), 16);
				Value *GepOffset = SVEIndexPhi;
				Value *Passthru = ConstantInt::getNullValue(SVELoadType);

				Value *SVELhsGep = Builder.CreateGEP(LoadType, PtrA, GepOffset);
				cast<GetElementPtrInst>(SVELhsGep)->setIsInBounds(true);
				Value *SVELhsLoad = Builder.CreateMaskedLoad(SVELoadType, SVELhsGep, Align(1),
				LoopPred, Passthru);

				craig.topperUnsubmitted Done Reply Inline Actions Do we need to check for inBounds on the original GEPs before we can set it here? craig.topper: Do we need to check for inBounds on the original GEPs before we can set it here?
				Value *SVERhsGep = Builder.CreateGEP(LoadType, PtrB, GepOffset);
				cast<GetElementPtrInst>(SVERhsGep)->setIsInBounds(true);
				Value *SVERhsLoad = Builder.CreateMaskedLoad(SVELoadType, SVERhsGep, Align(1),
				LoopPred, Passthru);

				craig.topperUnsubmitted Done Reply Inline Actions Do we need to check for inBounds on the original GEPs before we can set it here? craig.topper: Do we need to check for inBounds on the original GEPs before we can set it here?
				Value *SVEMatchCmp = Builder.CreateICmpNE(SVELhsLoad, SVERhsLoad);
				SVEMatchCmp = Builder.CreateSelect(LoopPred, SVEMatchCmp, PFalse);
				Value *SVEMatchHasActiveLanes = Builder.CreateOrReduce(SVEMatchCmp);
				BranchInst *SVEEarlyExit = BranchInst::Create(
				SVELoopMismatchBlock, SVELoopIncBlock, SVEMatchHasActiveLanes);
				Builder.Insert(SVEEarlyExit);

				// Increment the index counter and calculate the predicate for the next
				// iteration of the loop. We branch back to the start of the loop if there
				// is at least one active lane.
				Builder.SetInsertPoint(SVELoopIncBlock);
				Value *NewSVEIndexPhi = Builder.CreateAdd(SVEIndexPhi, VecLen, "",
				/HasNUW=/true, /HasNSW=/true);
				SVEIndexPhi->addIncoming(NewSVEIndexPhi, SVELoopIncBlock);
				Value *NewPred =
				Builder.CreateIntrinsic(Intrinsic::get_active_lane_mask,
				{PredVTy, I64Type}, {NewSVEIndexPhi, ExtEnd});
				LoopPred->addIncoming(NewPred, SVELoopIncBlock);

				Value *PredHasActiveLanes =
				Builder.CreateExtractElement(NewPred, uint64_t(0));
				BranchInst *SVELoopBranchBack =
				BranchInst::Create(SVELoopStartBlock, EndBlock, PredHasActiveLanes);
				Builder.Insert(SVELoopBranchBack);

				// If we found a mismatch then we need to calculate which lane in the vector
				// had a mismatch and add that on to the current loop index.
				Builder.SetInsertPoint(SVELoopMismatchBlock);
				Value *PredMatchCmp = Builder.CreateAnd(LoopPred, SVEMatchCmp);
				Value *Ctz = Builder.CreateIntrinsic(
				Intrinsic::experimental_cttz_elts, {ResType, SVEMatchCmp->getType()},
				{PredMatchCmp, /ZeroIsPoison=/Builder.getInt32(1)});
				Ctz = Builder.CreateZExt(Ctz, I64Type);
				Value *SVELoopRes64 = Builder.CreateAdd(SVEIndexPhi, Ctz, "",
				/HasNUW=/true, /HasNSW=/true);
				Value *SVELoopRes = Builder.CreateTrunc(SVELoopRes64, ResType);

				Builder.Insert(BranchInst::Create(EndBlock));

				// Generate code for scalar loop.
				Builder.SetInsertPoint(LoopPreHeaderBlock);
				PHINode *StartIndexPhi =
				Builder.CreatePHI(ResType, 2, "mismatch_start_index");
				StartIndexPhi->addIncoming(Start, MemCheckBlock);
				StartIndexPhi->addIncoming(Start, MinItCheckBlock);
				Builder.Insert(BranchInst::Create(LoopStartBlock));

				craig.topperUnsubmitted Done Reply Inline Actions Why do we need a phi if the incoming values are the same? craig.topper: Why do we need a phi if the incoming values are the same?
				Builder.SetInsertPoint(LoopStartBlock);
				PHINode *IndexPhi = Builder.CreatePHI(ResType, 2, "mismatch_index");
				IndexPhi->addIncoming(StartIndexPhi, LoopPreHeaderBlock);

				// Otherwise compare the values
				// Load bytes from each array and compare them.
				GepOffset = Builder.CreateZExt(IndexPhi, I64Type);

				Value *LhsGep = Builder.CreateGEP(LoadType, PtrA, GepOffset);
				cast<GetElementPtrInst>(LhsGep)->setIsInBounds(true);
				Value *LhsLoad = Builder.CreateLoad(LoadType, LhsGep);

				Value *RhsGep = Builder.CreateGEP(LoadType, PtrB, GepOffset);
				cast<GetElementPtrInst>(RhsGep)->setIsInBounds(true);
				Value *RhsLoad = Builder.CreateLoad(LoadType, RhsGep);

				Value *MatchCmp = Builder.CreateICmpEQ(LhsLoad, RhsLoad);
				// If we have a mismatch then exit the loop ...
				BranchInst *MatchCmpBr = BranchInst::Create(LoopIncBlock, EndBlock, MatchCmp);
				Builder.Insert(MatchCmpBr);
				// Have we reached the maximum permitted length for the loop?
				Builder.SetInsertPoint(LoopIncBlock);
				Value *PhiInc = Builder.CreateAdd(IndexPhi, ConstantInt::get(ResType, 1));
				IndexPhi->addIncoming(PhiInc, LoopIncBlock);
				Value *IVCmp = Builder.CreateICmpEQ(IndexPhi, MaxLen);
				BranchInst *IVCmpBr = BranchInst::Create(EndBlock, LoopStartBlock, IVCmp);
				Builder.Insert(IVCmpBr);

				// In the end block we need to insert a PHI node to deal with three cases:
				// 1. The length of the loop was zero, hence we jumped straight from
				// MinItCheckBlock.
				// 2. We didn't find a mismatch in the scalar loop, so we should return
				// MaxLen.
				// 3. We exitted the scalar loop early due to a mismatch and need to return
				// the index that we found.
				// 4. We didn't find a mismatch in the SVE loop, so we should return
				// MaxLen.
				// 5. We exitted the SVE loop early due to a mismatch and need to return
				// the index that we found.
				Builder.SetInsertPoint(EndBlock, EndBlock->getFirstInsertionPt());
				PHINode *ResPhi = Builder.CreatePHI(ResType, 4, "mismatch_result");
				ResPhi->addIncoming(MaxLen, LoopIncBlock);
				ResPhi->addIncoming(IndexPhi, LoopStartBlock);
				ResPhi->addIncoming(MaxLen, SVELoopIncBlock);
				ResPhi->addIncoming(SVELoopRes, SVELoopMismatchBlock);

				return Builder.CreateTrunc(ResPhi, ResType);
				}

				void AArch64LoopIdiomRecognize::transformByteCompare(
				Value PtrA, Value PtrB, Value MaxLen, Value Index, Value *Start,
				bool IncIdx, BasicBlock FoundBB, BasicBlock EndBB) {

				// Insert the byte compare intrinsic at the end of the preheader block
				BasicBlock *Preheader = CurLoop->getLoopPreheader();
				BasicBlock *Header = CurLoop->getHeader();
				BranchInst *PHBranch = cast<BranchInst>(Preheader->getTerminator());
				IRBuilder<> Builder(PHBranch);
				Builder.SetCurrentDebugLocation(PHBranch->getDebugLoc());

				// Increment the pointer if this was done before the loads in the loop.
				if (IncIdx)
				Start = Builder.CreateAdd(Start, ConstantInt::get(Start->getType(), 1));

				Value *ByteCmpRes = expandFindMismatch(Builder, PtrA, PtrB, Start, MaxLen);

				// Replaces uses of index & induction Phi with intrinsic (we already
				// checked that the the first instruction of Header is the Phi above).
				auto IndPhi = &Header->front();
				IndPhi->replaceAllUsesWith(ByteCmpRes);
				Index->replaceAllUsesWith(ByteCmpRes);

				assert(PHBranch->isUnconditional() &&
				"Expected preheader to terminate with an unconditional branch.");

				// If no mismatch was found, we can jump to the end block. Create a
				// new basic block for the compare instruction.
				auto *CmpBB = BasicBlock::Create(Preheader->getContext(), "byte.compare",
				Preheader->getParent());
				CmpBB->moveBefore(EndBB);

				// Replace the branch in the preheader with an always-true conditional branch.
				// This ensures there is still a reference to the original loop.
				Value *BrCnd = Builder.CreateICmpEQ(ConstantInt::get(Start->getType(), 1),
				ConstantInt::get(Start->getType(), 1));
				Builder.CreateCondBr(BrCnd, CmpBB, Header);
				PHBranch->eraseFromParent();

				// Create the branch to either the end or found block depending on the value
				// returned by the intrinsic.
				Builder.SetInsertPoint(CmpBB);
				Value *FoundCmp = Builder.CreateICmpEQ(ByteCmpRes, MaxLen);
				Builder.CreateCondBr(FoundCmp, EndBB, FoundBB);

				auto fixSuccessorPhis = [&](BasicBlock *SuccBB) {
				for (PHINode &PN : SuccBB->phis()) {
				// At this point we've already replaced all uses of the result from the
				// loop with ByteCmp. Look through the incoming values to find ByteCmp,
				// meaning this is a Phi collecting the results of the byte compare.
				bool ResPhi = false;
				for (Value *Op : PN.incoming_values())
				if (Op == CmpBB)
				ResPhi = true;

				// If any of the incoming values were ByteCmp, we need to also add
				// it as an incoming value from CmpBB.
				if (ResPhi)
				PN.addIncoming(ByteCmpRes, CmpBB);
				else {
				// Otherwise, this is a Phi for different values. We should create
				// a new incoming value from CmpBB matching the same value as from
				// the old loop.
				for (BasicBlock *BB : PN.blocks())
				if (CurLoop->contains(BB)) {
				PN.addIncoming(PN.getIncomingValueForBlock(BB), CmpBB);
				break;
				}
				}
				}
				};

				// Ensure all Phis in the successors of CmpBB have an incoming value from it.
				fixSuccessorPhis(EndBB);
				fixSuccessorPhis(FoundBB);

				// The new CmpBB block isn't part of the loop, but will need to be added to
				// the outer loop if there is one.
				if (!CurLoop->isOutermost())
				CurLoop->getParentLoop()->addBasicBlockToLoop(CmpBB, *LI);

				// Update the dominator tree with the new block.
				DT->addNewBlock(CmpBB, Preheader);
				}

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,958 Lines • ▼ Show 20 Lines	let Predicates = [HasSVEorSME] in {
def ADDVL_XXI : sve_int_arith_vl<0b0, "addvl">;		def ADDVL_XXI : sve_int_arith_vl<0b0, "addvl">;
def ADDPL_XXI : sve_int_arith_vl<0b1, "addpl">;		def ADDPL_XXI : sve_int_arith_vl<0b1, "addpl">;

defm CNTB_XPiI : sve_int_count<0b000, "cntb", int_aarch64_sve_cntb>;		defm CNTB_XPiI : sve_int_count<0b000, "cntb", int_aarch64_sve_cntb>;
defm CNTH_XPiI : sve_int_count<0b010, "cnth", int_aarch64_sve_cnth>;		defm CNTH_XPiI : sve_int_count<0b010, "cnth", int_aarch64_sve_cnth>;
defm CNTW_XPiI : sve_int_count<0b100, "cntw", int_aarch64_sve_cntw>;		defm CNTW_XPiI : sve_int_count<0b100, "cntw", int_aarch64_sve_cntw>;
defm CNTD_XPiI : sve_int_count<0b110, "cntd", int_aarch64_sve_cntd>;		defm CNTD_XPiI : sve_int_count<0b110, "cntd", int_aarch64_sve_cntd>;
defm CNTP_XPP : sve_int_pcount_pred<0b0000, "cntp", int_aarch64_sve_cntp>;		defm CNTP_XPP : sve_int_pcount_pred<0b0000, "cntp", int_aarch64_sve_cntp>;

		def : Pat<(i64 (int_experimental_cttz_elts nxv16i1:$Op1, (i32 timm32_0_1))),
		(i64 (!cast<Instruction>(CNTP_XPP_B)
		(nxv16i1 (!cast<Instruction>(BRKB_PPzP) (PTRUE_B 31), nxv16i1:$Op1)),
		(nxv16i1 (!cast<Instruction>(BRKB_PPzP) (PTRUE_B 31), nxv16i1:$Op1))))>;
}		}

defm INCB_XPiI : sve_int_pred_pattern_a<0b000, "incb", add, int_aarch64_sve_cntb>;		defm INCB_XPiI : sve_int_pred_pattern_a<0b000, "incb", add, int_aarch64_sve_cntb>;
defm DECB_XPiI : sve_int_pred_pattern_a<0b001, "decb", sub, int_aarch64_sve_cntb>;		defm DECB_XPiI : sve_int_pred_pattern_a<0b001, "decb", sub, int_aarch64_sve_cntb>;
defm INCH_XPiI : sve_int_pred_pattern_a<0b010, "inch", add, int_aarch64_sve_cnth>;		defm INCH_XPiI : sve_int_pred_pattern_a<0b010, "inch", add, int_aarch64_sve_cnth>;
defm DECH_XPiI : sve_int_pred_pattern_a<0b011, "dech", sub, int_aarch64_sve_cnth>;		defm DECH_XPiI : sve_int_pred_pattern_a<0b011, "dech", sub, int_aarch64_sve_cnth>;
defm INCW_XPiI : sve_int_pred_pattern_a<0b100, "incw", add, int_aarch64_sve_cntw>;		defm INCW_XPiI : sve_int_pred_pattern_a<0b100, "incw", add, int_aarch64_sve_cntw>;
defm DECW_XPiI : sve_int_pred_pattern_a<0b101, "decw", sub, int_aarch64_sve_cntw>;		defm DECW_XPiI : sve_int_pred_pattern_a<0b101, "decw", sub, int_aarch64_sve_cntw>;
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	let Predicates = [HasSVEorSME] in {

defm SQINCP_ZP : sve_int_count_v<0b00000, "sqincp", int_aarch64_sve_sqincp>;		defm SQINCP_ZP : sve_int_count_v<0b00000, "sqincp", int_aarch64_sve_sqincp>;
defm UQINCP_ZP : sve_int_count_v<0b00100, "uqincp", int_aarch64_sve_uqincp>;		defm UQINCP_ZP : sve_int_count_v<0b00100, "uqincp", int_aarch64_sve_uqincp>;
defm SQDECP_ZP : sve_int_count_v<0b01000, "sqdecp", int_aarch64_sve_sqdecp>;		defm SQDECP_ZP : sve_int_count_v<0b01000, "sqdecp", int_aarch64_sve_sqdecp>;
defm UQDECP_ZP : sve_int_count_v<0b01100, "uqdecp", int_aarch64_sve_uqdecp>;		defm UQDECP_ZP : sve_int_count_v<0b01100, "uqdecp", int_aarch64_sve_uqdecp>;
defm INCP_ZP : sve_int_count_v<0b10000, "incp">;		defm INCP_ZP : sve_int_count_v<0b10000, "incp">;
defm DECP_ZP : sve_int_count_v<0b10100, "decp">;		defm DECP_ZP : sve_int_count_v<0b10100, "decp">;

		def : Pat<(i64 (add GPR64:$Op1, (i64 (int_experimental_cttz_elts nxv16i1:$Op2, (i32 timm32_0_1))))),
		(i64 (!cast<Instruction>(INCP_XP_B)
		(nxv16i1 (!cast<Instruction>(BRKB_PPzP) (PTRUE_B 31), nxv16i1:$Op2)),
		GPR64:$Op1))>;

		def : Pat<(i32 (add GPR32:$Op1, (trunc (i64 (int_experimental_cttz_elts nxv16i1:$Op2, (i32 timm32_0_1)))))),
		(i32 (EXTRACT_SUBREG (i64 (!cast<Instruction>(INCP_XP_B)
		(nxv16i1 (!cast<Instruction>(BRKB_PPzP) (PTRUE_B 31), nxv16i1:$Op2)),
		(INSERT_SUBREG (i64 (IMPLICIT_DEF)), GPR32:$Op1, sub_32))),
		sub_32))>;

defm INDEX_RR : sve_int_index_rr<"index", AArch64mul_p_oneuse>;		defm INDEX_RR : sve_int_index_rr<"index", AArch64mul_p_oneuse>;
defm INDEX_IR : sve_int_index_ir<"index", AArch64mul_p, AArch64mul_p_oneuse>;		defm INDEX_IR : sve_int_index_ir<"index", AArch64mul_p, AArch64mul_p_oneuse>;
defm INDEX_RI : sve_int_index_ri<"index">;		defm INDEX_RI : sve_int_index_ri<"index">;
defm INDEX_II : sve_int_index_ii<"index">;		defm INDEX_II : sve_int_index_ii<"index">;

// Unpredicated shifts		// Unpredicated shifts
defm ASR_ZZI : sve_int_bin_cons_shift_imm_right<0b00, "asr", AArch64asr_p>;		defm ASR_ZZI : sve_int_bin_cons_shift_imm_right<0b00, "asr", AArch64asr_p>;
defm LSR_ZZI : sve_int_bin_cons_shift_imm_right<0b01, "lsr", AArch64lsr_p>;		defm LSR_ZZI : sve_int_bin_cons_shift_imm_right<0b01, "lsr", AArch64lsr_p>;
▲ Show 20 Lines • Show All 1,943 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetMachine.h

//==-- AArch64TargetMachine.h - Define TargetMachine for AArch64 -- C++ --==//		//==-- AArch64TargetMachine.h - Define TargetMachine for AArch64 -- C++ --==//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file declares the AArch64 specific subclass of TargetMachine.		// This file declares the AArch64 specific subclass of TargetMachine.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_LIB_TARGET_AARCH64_AARCH64TARGETMACHINE_H		#ifndef LLVM_LIB_TARGET_AARCH64_AARCH64TARGETMACHINE_H
#define LLVM_LIB_TARGET_AARCH64_AARCH64TARGETMACHINE_H		#define LLVM_LIB_TARGET_AARCH64_AARCH64TARGETMACHINE_H

#include "AArch64InstrInfo.h"		#include "AArch64InstrInfo.h"
		#include "AArch64LoopIdiomRecognize.h"
#include "AArch64Subtarget.h"		#include "AArch64Subtarget.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
#include <optional>		#include <optional>

namespace llvm {		namespace llvm {

class AArch64TargetMachine : public LLVMTargetMachine {		class AArch64TargetMachine : public LLVMTargetMachine {
Show All 13 Lines	public:
// DO NOT IMPLEMENT: There is no such thing as a valid default subtarget,		// DO NOT IMPLEMENT: There is no such thing as a valid default subtarget,
// subtargets are per-function entities based on the target-specific		// subtargets are per-function entities based on the target-specific
// attributes of each function.		// attributes of each function.
const AArch64Subtarget *getSubtargetImpl() const = delete;		const AArch64Subtarget *getSubtargetImpl() const = delete;

// Pass Pipeline Configuration		// Pass Pipeline Configuration
TargetPassConfig *createPassConfig(PassManagerBase &PM) override;		TargetPassConfig *createPassConfig(PassManagerBase &PM) override;

		void registerPassBuilderCallbacks(PassBuilder &PB) override;

TargetTransformInfo getTargetTransformInfo(const Function &F) const override;		TargetTransformInfo getTargetTransformInfo(const Function &F) const override;

TargetLoweringObjectFile* getObjFileLowering() const override {		TargetLoweringObjectFile* getObjFileLowering() const override {
return TLOF.get();		return TLOF.get();
}		}

MachineFunctionInfo *		MachineFunctionInfo *
createMachineFunctionInfo(BumpPtrAllocator &Allocator, const Function &F,		createMachineFunctionInfo(BumpPtrAllocator &Allocator, const Function &F,
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp

//===-- AArch64TargetMachine.cpp - Define TargetMachine for AArch64 -------===//		//===-- AArch64TargetMachine.cpp - Define TargetMachine for AArch64 -------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AArch64TargetMachine.h"		#include "AArch64TargetMachine.h"
#include "AArch64.h"		#include "AArch64.h"
		#include "AArch64LoopIdiomRecognize.h"
#include "AArch64MachineFunctionInfo.h"		#include "AArch64MachineFunctionInfo.h"
#include "AArch64MachineScheduler.h"		#include "AArch64MachineScheduler.h"
#include "AArch64MacroFusion.h"		#include "AArch64MacroFusion.h"
#include "AArch64Subtarget.h"		#include "AArch64Subtarget.h"
#include "AArch64TargetObjectFile.h"		#include "AArch64TargetObjectFile.h"
#include "AArch64TargetTransformInfo.h"		#include "AArch64TargetTransformInfo.h"
#include "MCTargetDesc/AArch64MCTargetDesc.h"		#include "MCTargetDesc/AArch64MCTargetDesc.h"
#include "TargetInfo/AArch64TargetInfo.h"		#include "TargetInfo/AArch64TargetInfo.h"
Show All 16 Lines
#include "llvm/CodeGen/TargetPassConfig.h"		#include "llvm/CodeGen/TargetPassConfig.h"
#include "llvm/IR/Attributes.h"		#include "llvm/IR/Attributes.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/MC/MCAsmInfo.h"		#include "llvm/MC/MCAsmInfo.h"
#include "llvm/MC/MCTargetOptions.h"		#include "llvm/MC/MCTargetOptions.h"
#include "llvm/MC/TargetRegistry.h"		#include "llvm/MC/TargetRegistry.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
		#include "llvm/Passes/PassBuilder.h"
#include "llvm/Support/CodeGen.h"		#include "llvm/Support/CodeGen.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Target/TargetLoweringObjectFile.h"		#include "llvm/Target/TargetLoweringObjectFile.h"
#include "llvm/Target/TargetOptions.h"		#include "llvm/Target/TargetOptions.h"
#include "llvm/TargetParser/Triple.h"		#include "llvm/TargetParser/Triple.h"
#include "llvm/Transforms/CFGuard.h"		#include "llvm/Transforms/CFGuard.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include <memory>		#include <memory>
▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAArch64Target() {
initializeAArch64BranchTargetsPass(*PR);		initializeAArch64BranchTargetsPass(*PR);
initializeAArch64CollectLOHPass(*PR);		initializeAArch64CollectLOHPass(*PR);
initializeAArch64CompressJumpTablesPass(*PR);		initializeAArch64CompressJumpTablesPass(*PR);
initializeAArch64ConditionalComparesPass(*PR);		initializeAArch64ConditionalComparesPass(*PR);
initializeAArch64ConditionOptimizerPass(*PR);		initializeAArch64ConditionOptimizerPass(*PR);
initializeAArch64DeadRegisterDefinitionsPass(*PR);		initializeAArch64DeadRegisterDefinitionsPass(*PR);
initializeAArch64ExpandPseudoPass(*PR);		initializeAArch64ExpandPseudoPass(*PR);
initializeAArch64LoadStoreOptPass(*PR);		initializeAArch64LoadStoreOptPass(*PR);
		initializeAArch64LoopIdiomRecognizeLegacyPassPass(*PR);
initializeAArch64MIPeepholeOptPass(*PR);		initializeAArch64MIPeepholeOptPass(*PR);
initializeAArch64SIMDInstrOptPass(*PR);		initializeAArch64SIMDInstrOptPass(*PR);
initializeAArch64O0PreLegalizerCombinerPass(*PR);		initializeAArch64O0PreLegalizerCombinerPass(*PR);
initializeAArch64PreLegalizerCombinerPass(*PR);		initializeAArch64PreLegalizerCombinerPass(*PR);
initializeAArch64PostLegalizerCombinerPass(*PR);		initializeAArch64PostLegalizerCombinerPass(*PR);
initializeAArch64PostLegalizerLoweringPass(*PR);		initializeAArch64PostLegalizerLoweringPass(*PR);
initializeAArch64PostSelectOptimizePass(*PR);		initializeAArch64PostSelectOptimizePass(*PR);
initializeAArch64PromoteConstantPass(*PR);		initializeAArch64PromoteConstantPass(*PR);
▲ Show 20 Lines • Show All 295 Lines • ▼ Show 20 Lines	public:
void addPostBBSections() override;		void addPostBBSections() override;
void addPreEmitPass2() override;		void addPreEmitPass2() override;

std::unique_ptr<CSEConfigBase> getCSEConfig() const override;		std::unique_ptr<CSEConfigBase> getCSEConfig() const override;
};		};

} // end anonymous namespace		} // end anonymous namespace

		void AArch64TargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
		PB.registerLateLoopOptimizationsEPCallback(
		[=](LoopPassManager &LPM, OptimizationLevel Level) {
		LPM.addPass(AArch64LoopIdiomRecognizePass());
		});
		}

TargetTransformInfo		TargetTransformInfo
AArch64TargetMachine::getTargetTransformInfo(const Function &F) const {		AArch64TargetMachine::getTargetTransformInfo(const Function &F) const {
return TargetTransformInfo(AArch64TTIImpl(this, F));		return TargetTransformInfo(AArch64TTIImpl(this, F));
}		}

TargetPassConfig *AArch64TargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *AArch64TargetMachine::createPassConfig(PassManagerBase &PM) {
return new AArch64PassConfig(*this, PM);		return new AArch64PassConfig(*this, PM);
}		}
▲ Show 20 Lines • Show All 328 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/CMakeLists.txt

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	add_llvm_target(AArch64CodeGen
AArch64GlobalsTagging.cpp		AArch64GlobalsTagging.cpp
AArch64CompressJumpTables.cpp		AArch64CompressJumpTables.cpp
AArch64ConditionOptimizer.cpp		AArch64ConditionOptimizer.cpp
AArch64RedundantCopyElimination.cpp		AArch64RedundantCopyElimination.cpp
AArch64ISelDAGToDAG.cpp		AArch64ISelDAGToDAG.cpp
AArch64ISelLowering.cpp		AArch64ISelLowering.cpp
AArch64InstrInfo.cpp		AArch64InstrInfo.cpp
AArch64LoadStoreOptimizer.cpp		AArch64LoadStoreOptimizer.cpp
		AArch64LoopIdiomRecognize.cpp
AArch64LowerHomogeneousPrologEpilog.cpp		AArch64LowerHomogeneousPrologEpilog.cpp
AArch64MachineFunctionInfo.cpp		AArch64MachineFunctionInfo.cpp
AArch64MachineScheduler.cpp		AArch64MachineScheduler.cpp
AArch64MacroFusion.cpp		AArch64MacroFusion.cpp
AArch64MIPeepholeOpt.cpp		AArch64MIPeepholeOpt.cpp
AArch64MCInstLower.cpp		AArch64MCInstLower.cpp
AArch64PromoteConstant.cpp		AArch64PromoteConstant.cpp
AArch64PBQPRegAlloc.cpp		AArch64PBQPRegAlloc.cpp
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/intrinsic-cttz-elts.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
				; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s

				; FIXED WIDTH

				define i8 @ctz_v8i1(<8 x i1> %a) {
				; CHECK-LABEL: .LCPI0_0:
				; CHECK-NEXT: .byte 8
				; CHECK-NEXT: .byte 7
				; CHECK-NEXT: .byte 6
				; CHECK-NEXT: .byte 5
				; CHECK-NEXT: .byte 4
				; CHECK-NEXT: .byte 3
				; CHECK-NEXT: .byte 2
				; CHECK-NEXT: .byte 1
				; CHECK-LABEL: ctz_v8i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: shl v0.8b, v0.8b, #7
				; CHECK-NEXT: adrp x8, .LCPI0_0
				; CHECK-NEXT: mov w9, #8 // =0x8
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI0_0]
				; CHECK-NEXT: cmlt v0.8b, v0.8b, #0
				; CHECK-NEXT: and v0.8b, v0.8b, v1.8b
				; CHECK-NEXT: umaxv b0, v0.8b
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: sub w0, w9, w8
				; CHECK-NEXT: ret
				%res = call i8 @llvm.experimental.cttz.elts.i8.v8i1(<8 x i1> %a, i32 0)
				ret i8 %res
				}

				define i32 @ctz_v16i1(<16 x i1> %a) {
				; CHECK-LABEL: .LCPI1_0:
				; CHECK-NEXT: .byte 16
				; CHECK-NEXT: .byte 15
				; CHECK-NEXT: .byte 14
				; CHECK-NEXT: .byte 13
				; CHECK-NEXT: .byte 12
				; CHECK-NEXT: .byte 11
				; CHECK-NEXT: .byte 10
				; CHECK-NEXT: .byte 9
				; CHECK-NEXT: .byte 8
				; CHECK-NEXT: .byte 7
				; CHECK-NEXT: .byte 6
				; CHECK-NEXT: .byte 5
				; CHECK-NEXT: .byte 4
				; CHECK-NEXT: .byte 3
				; CHECK-NEXT: .byte 2
				; CHECK-NEXT: .byte 1
				; CHECK-LABEL: ctz_v16i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: shl v0.16b, v0.16b, #7
				; CHECK-NEXT: adrp x8, .LCPI1_0
				; CHECK-NEXT: mov w9, #16 // =0x10
				; CHECK-NEXT: ldr q1, [x8, :lo12:.LCPI1_0]
				; CHECK-NEXT: cmlt v0.16b, v0.16b, #0
				; CHECK-NEXT: and v0.16b, v0.16b, v1.16b
				; CHECK-NEXT: umaxv b0, v0.16b
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: sub w8, w9, w8
				; CHECK-NEXT: and w0, w8, #0xff
				; CHECK-NEXT: ret
				%res = call i32 @llvm.experimental.cttz.elts.i32.v16i1(<16 x i1> %a, i32 0)
				ret i32 %res
				}

				define i16 @ctz_v4i32(<4 x i32> %a) {
				; CHECK-LABEL: .LCPI2_0:
				; CHECK-NEXT: .hword 4
				; CHECK-NEXT: .hword 3
				; CHECK-NEXT: .hword 2
				; CHECK-NEXT: .hword 1
				; CHECK-LABEL: ctz_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cmtst v0.4s, v0.4s, v0.4s
				; CHECK-NEXT: adrp x8, .LCPI2_0
				; CHECK-NEXT: mov w9, #4 // =0x4
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI2_0]
				; CHECK-NEXT: xtn v0.4h, v0.4s
				; CHECK-NEXT: and v0.8b, v0.8b, v1.8b
				; CHECK-NEXT: umaxv h0, v0.4h
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: sub w8, w9, w8
				; CHECK-NEXT: and w0, w8, #0xff
				; CHECK-NEXT: ret
				%res = call i16 @llvm.experimental.cttz.elts.i16.v4i32(<4 x i32> %a, i32 0)
				ret i16 %res
				}

				; ZERO IS POISON

				define i8 @ctz_v8i1_poison(<8 x i1> %a) {
				; CHECK-LABEL: .LCPI3_0:
				; CHECK-NEXT: .byte 8
				; CHECK-NEXT: .byte 7
				; CHECK-NEXT: .byte 6
				; CHECK-NEXT: .byte 5
				; CHECK-NEXT: .byte 4
				; CHECK-NEXT: .byte 3
				; CHECK-NEXT: .byte 2
				; CHECK-NEXT: .byte 1
				; CHECK-LABEL: ctz_v8i1_poison:
				; CHECK: // %bb.0:
				; CHECK-NEXT: shl v0.8b, v0.8b, #7
				; CHECK-NEXT: adrp x8, .LCPI3_0
				; CHECK-NEXT: mov w9, #8 // =0x8
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI3_0]
				; CHECK-NEXT: cmlt v0.8b, v0.8b, #0
				; CHECK-NEXT: and v0.8b, v0.8b, v1.8b
				; CHECK-NEXT: umaxv b0, v0.8b
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: sub w8, w9, w8
				; CHECK-NEXT: and w9, w8, #0xff
				; CHECK-NEXT: cmp w9, #8
				; CHECK-NEXT: csel w0, w8, w8, eq
				; CHECK-NEXT: ret
				%res = call i8 @llvm.experimental.cttz.elts.i8.v8i1(<8 x i1> %a, i32 1)
				ret i8 %res
				}

				; SCALABLE, WITH VSCALE RANGE

				define i64 @ctz_nxv8i1(<vscale x 8 x i1> %a) #0 {
				; CHECK-LABEL: ctz_nxv8i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: index z0.h, #0, #-1
				; CHECK-NEXT: mov z1.h, p0/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: cnth x9
				; CHECK-NEXT: inch z0.h
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: and z0.h, z0.h, #0xff
				; CHECK-NEXT: umaxv h0, p0, z0.h
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: sub w8, w9, w8
				; CHECK-NEXT: and x0, x8, #0xff
				; CHECK-NEXT: ret
				%res = call i64 @llvm.experimental.cttz.elts.i64.nxv8i1(<vscale x 8 x i1> %a, i32 0)
				ret i64 %res
				}

				define i32 @ctz_nxv32i1(<vscale x 32 x i1> %a) #0 {
				; CHECK-LABEL: ctz_nxv32i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: index z0.h, #0, #-1
				; CHECK-NEXT: cnth x8
				; CHECK-NEXT: punpklo p2.h, p0.b
				; CHECK-NEXT: neg x8, x8
				; CHECK-NEXT: punpklo p3.h, p1.b
				; CHECK-NEXT: rdvl x9, #2
				; CHECK-NEXT: punpkhi p0.h, p0.b
				; CHECK-NEXT: mov z1.h, w8
				; CHECK-NEXT: rdvl x8, #-1
				; CHECK-NEXT: punpkhi p1.h, p1.b
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: inch z0.h, all, mul #4
				; CHECK-NEXT: mov z3.h, p2/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: ptrue p2.h
				; CHECK-NEXT: mov z5.h, p3/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: add z1.h, z0.h, z1.h
				; CHECK-NEXT: add z4.h, z0.h, z2.h
				; CHECK-NEXT: mov z6.h, p0/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: mov z7.h, p1/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: and z0.d, z0.d, z3.d
				; CHECK-NEXT: add z2.h, z1.h, z2.h
				; CHECK-NEXT: and z3.d, z4.d, z5.d
				; CHECK-NEXT: and z1.d, z1.d, z6.d
				; CHECK-NEXT: and z2.d, z2.d, z7.d
				; CHECK-NEXT: umax z0.h, p2/m, z0.h, z3.h
				; CHECK-NEXT: umax z1.h, p2/m, z1.h, z2.h
				; CHECK-NEXT: umax z0.h, p2/m, z0.h, z1.h
				; CHECK-NEXT: umaxv h0, p2, z0.h
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: sub w8, w9, w8
				; CHECK-NEXT: and w0, w8, #0xffff
				; CHECK-NEXT: ret
				%res = call i32 @llvm.experimental.cttz.elts.i32.nxv32i1(<vscale x 32 x i1> %a, i32 0)
				ret i32 %res
				}

				define i32 @ctz_nxv4i32(<vscale x 4 x i32> %a) #0 {
				; CHECK-LABEL: ctz_nxv4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: index z1.s, #0, #-1
				; CHECK-NEXT: cntw x9
				; CHECK-NEXT: incw z1.s
				; CHECK-NEXT: cmpne p1.s, p0/z, z0.s, #0
				; CHECK-NEXT: mov z0.s, p1/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: and z0.d, z1.d, z0.d
				; CHECK-NEXT: and z0.s, z0.s, #0xff
				; CHECK-NEXT: umaxv s0, p0, z0.s
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: sub w8, w9, w8
				; CHECK-NEXT: and w0, w8, #0xff
				; CHECK-NEXT: ret
				%res = call i32 @llvm.experimental.cttz.elts.i32.nxv4i32(<vscale x 4 x i32> %a, i32 0)
				ret i32 %res
				}

				; SCALABLE, NO VSCALE RANGE

				define i32 @ctz_nxv8i1_no_range(<vscale x 8 x i1> %a) {
				; CHECK-LABEL: ctz_nxv8i1_no_range:
				; CHECK: // %bb.0:
				; CHECK-NEXT: index z0.s, #0, #-1
				; CHECK-NEXT: punpklo p1.h, p0.b
				; CHECK-NEXT: cntw x8
				; CHECK-NEXT: punpkhi p0.h, p0.b
				; CHECK-NEXT: neg x8, x8
				; CHECK-NEXT: cnth x9
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: incw z0.s, all, mul #2
				; CHECK-NEXT: mov z2.s, p1/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: mov z3.s, p0/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: add z1.s, z0.s, z1.s
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: and z1.d, z1.d, z3.d
				; CHECK-NEXT: umax z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: umaxv s0, p0, z0.s
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: sub w0, w9, w8
				; CHECK-NEXT: ret
				%res = call i32 @llvm.experimental.cttz.elts.i32.nxv8i1(<vscale x 8 x i1> %a, i32 0)
				ret i32 %res
				}

				; MATCH WITH BRKB + CNTP

				define i32 @ctz_nxv16i1_0(<vscale x 16 x i1> %pg, <vscale x 16 x i1> %a) {
				; CHECK-LABEL: ctz_nxv16i1_0:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: brkb p0.b, p0/z, p1.b
				; CHECK-NEXT: cntp x0, p0, p0.b
				; CHECK-NEXT: // kill: def $w0 killed $w0 killed $x0
				; CHECK-NEXT: ret
				%res = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> %a, i32 0)
				ret i32 %res
				}

				define i32 @ctz_nxv16i1_1(<vscale x 16 x i1> %pg, <vscale x 16 x i1> %a) {
				; CHECK-LABEL: ctz_nxv16i1_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: brkb p0.b, p0/z, p1.b
				; CHECK-NEXT: cntp x0, p0, p0.b
				; CHECK-NEXT: // kill: def $w0 killed $w0 killed $x0
				; CHECK-NEXT: ret
				%res = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> %a, i32 1)
				ret i32 %res
				}

				define i32 @ctz_and_nxv16i1(<vscale x 16 x i1> %pg, <vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
				; CHECK-LABEL: ctz_and_nxv16i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p1.b
				; CHECK-NEXT: cmpne p0.b, p0/z, z0.b, z1.b
				; CHECK-NEXT: brkb p0.b, p1/z, p0.b
				; CHECK-NEXT: cntp x0, p0, p0.b
				; CHECK-NEXT: // kill: def $w0 killed $w0 killed $x0
				; CHECK-NEXT: ret
				%cmp = icmp ne <vscale x 16 x i8> %a, %b
				%select = select <vscale x 16 x i1> %pg, <vscale x 16 x i1> %cmp, <vscale x 16 x i1> zeroinitializer
				%and = and <vscale x 16 x i1> %pg, %select
				%res = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> %and, i32 0)
				ret i32 %res
				}

				define i64 @add_i64_ctz_nxv16i1_1(<vscale x 16 x i1> %pg, <vscale x 16 x i1> %a, i64 %b) {
				; CHECK-LABEL: add_i64_ctz_nxv16i1_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: brkb p0.b, p0/z, p1.b
				; CHECK-NEXT: incp x0, p0.b
				; CHECK-NEXT: ret
				%res = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> %a, i32 1)
				%add = add i64 %res, %b
				ret i64 %add
				}

				define i32 @add_i32_ctz_nxv16i1_1(<vscale x 16 x i1> %pg, <vscale x 16 x i1> %a, i32 %b) {
				; CHECK-LABEL: add_i32_ctz_nxv16i1_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: // kill: def $w0 killed $w0 def $x0
				; CHECK-NEXT: brkb p0.b, p0/z, p1.b
				; CHECK-NEXT: incp x0, p0.b
				; CHECK-NEXT: // kill: def $w0 killed $w0 killed $x0
				; CHECK-NEXT: ret
				%res = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> %a, i32 1)
				%trunc = trunc i64 %res to i32
				%add = add i32 %trunc, %b
				ret i32 %add
				}

				declare i8 @llvm.experimental.cttz.elts.i8.v8i1(<8 x i1>, i32)
				declare i32 @llvm.experimental.cttz.elts.i32.v16i1(<16 x i1>, i32)
				declare i16 @llvm.experimental.cttz.elts.i16.v4i32(<4 x i32>, i32)

				declare i32 @llvm.experimental.cttz.elts.i32.nxv8i1(<vscale x 8 x i1>, i32)
				declare i64 @llvm.experimental.cttz.elts.i64.nxv8i1(<vscale x 8 x i1>, i32)
				declare i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1>, i32)
				declare i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1>, i32)
				declare i32 @llvm.experimental.cttz.elts.i32.nxv32i1(<vscale x 32 x i1>, i32)
				declare i32 @llvm.experimental.cttz.elts.i32.nxv4i32(<vscale x 4 x i32>, i32)

				attributes #0 = { vscale_range(1,16) }

llvm/test/Other/new-pm-defaults.ll

	Show First 20 Lines • Show All 173 Lines • ▼ Show 20 Lines
	; CHECK-O-NEXT: Running pass: LICM			; CHECK-O-NEXT: Running pass: LICM
	; CHECK-O-NEXT: Running pass: SimpleLoopUnswitchPass			; CHECK-O-NEXT: Running pass: SimpleLoopUnswitchPass
	; CHECK-O-NEXT: Running analysis: OuterAnalysisManagerProxy			; CHECK-O-NEXT: Running analysis: OuterAnalysisManagerProxy
	; CHECK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-O-NEXT: Running pass: SimplifyCFGPass
	; CHECK-O-NEXT: Running pass: InstCombinePass			; CHECK-O-NEXT: Running pass: InstCombinePass
	; CHECK-O-NEXT: Running pass: LoopSimplifyPass			; CHECK-O-NEXT: Running pass: LoopSimplifyPass
	; CHECK-O-NEXT: Running pass: LCSSAPass			; CHECK-O-NEXT: Running pass: LCSSAPass
	; CHECK-O-NEXT: Running pass: LoopIdiomRecognizePass			; CHECK-O-NEXT: Running pass: LoopIdiomRecognizePass
	; CHECK-O-NEXT: Running pass: IndVarSimplifyPass
	; CHECK-EP-LOOP-LATE-NEXT: Running pass: NoOpLoopPass			; CHECK-EP-LOOP-LATE-NEXT: Running pass: NoOpLoopPass
				; CHECK-O-NEXT: Running pass: IndVarSimplifyPass
	; CHECK-O-NEXT: Running pass: LoopDeletionPass			; CHECK-O-NEXT: Running pass: LoopDeletionPass
	; CHECK-O-NEXT: Running pass: LoopFullUnrollPass			; CHECK-O-NEXT: Running pass: LoopFullUnrollPass
	; CHECK-EP-LOOP-END-NEXT: Running pass: NoOpLoopPass			; CHECK-EP-LOOP-END-NEXT: Running pass: NoOpLoopPass
	; CHECK-O-NEXT: Running pass: SROAPass on foo			; CHECK-O-NEXT: Running pass: SROAPass on foo
	; CHECK-O23SZ-NEXT: Running pass: VectorCombinePass			; CHECK-O23SZ-NEXT: Running pass: VectorCombinePass
	; CHECK-O23SZ-NEXT: Running pass: MergedLoadStoreMotionPass			; CHECK-O23SZ-NEXT: Running pass: MergedLoadStoreMotionPass
	; CHECK-O23SZ-NEXT: Running pass: GVNPass			; CHECK-O23SZ-NEXT: Running pass: GVNPass
	; CHECK-O23SZ-NEXT: Running analysis: MemoryDependenceAnalysis			; CHECK-O23SZ-NEXT: Running analysis: MemoryDependenceAnalysis
	▲ Show 20 Lines • Show All 121 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
				; RUN: opt -aarch64-lir -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S < %s \| FileCheck %s
				; RUN: opt -aarch64-lir -simplifycfg -mtriple aarch64-unknown-linux-gnu -mattr=+sve -S < %s \| FileCheck %s --check-prefix=LOOP-DEL

				define i32 @compare_bytes_simple(ptr %a, ptr %b, i32 %len, i32 %n) {
				; CHECK-LABEL: define i32 @compare_bytes_simple(
				; CHECK-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]]) #[[ATTR0:[0-9]+]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[LEN]], 1
				; CHECK-NEXT: br label [[MISMATCH_MIN_IT_CHECK:%.*]]
				; CHECK: mismatch_min_it_check:
				; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = zext i32 [[N]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = icmp ule i32 [[TMP0]], [[N]]
				; CHECK-NEXT: br i1 [[TMP3]], label [[MISMATCH_MEM_CHECK:%.]], label [[MISMATCH_LOOP_PRE:%.]], !prof [[PROF0:![0-9]+]]
				; CHECK: mismatch_mem_check:
				; CHECK-NEXT: [[TMP4:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
				; CHECK-NEXT: [[TMP7:%.*]] = ptrtoint ptr [[TMP4]] to i64
				; CHECK-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP2]]
				; CHECK-NEXT: [[TMP9:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP2]]
				; CHECK-NEXT: [[TMP10:%.*]] = ptrtoint ptr [[TMP8]] to i64
				; CHECK-NEXT: [[TMP11:%.*]] = ptrtoint ptr [[TMP9]] to i64
				; CHECK-NEXT: [[TMP12:%.*]] = lshr i64 [[TMP7]], 12
				; CHECK-NEXT: [[TMP13:%.*]] = lshr i64 [[TMP10]], 12
				; CHECK-NEXT: [[TMP14:%.*]] = lshr i64 [[TMP6]], 12
				; CHECK-NEXT: [[TMP15:%.*]] = lshr i64 [[TMP11]], 12
				; CHECK-NEXT: [[TMP16:%.*]] = icmp ne i64 [[TMP12]], [[TMP13]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp ne i64 [[TMP14]], [[TMP15]]
				; CHECK-NEXT: [[TMP18:%.*]] = or i1 [[TMP16]], [[TMP17]]
				; CHECK-NEXT: br i1 [[TMP18]], label [[MISMATCH_LOOP_PRE]], label [[MISMATCH_SVE_LOOP_PREHEADER:%.*]], !prof [[PROF1:![0-9]+]]
				; CHECK: mismatch_sve_loop_preheader:
				; CHECK-NEXT: [[TMP19:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP1]], i64 [[TMP2]])
				; CHECK-NEXT: [[TMP20:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP21:%.*]] = mul nuw nsw i64 [[TMP20]], 16
				; CHECK-NEXT: br label [[MISMATCH_SVE_LOOP:%.*]]
				; CHECK: mismatch_sve_loop:
				; CHECK-NEXT: [[MISMATCH_SVE_LOOP_PRED:%.]] = phi <vscale x 16 x i1> [ [[TMP19]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP30:%.]], [[MISMATCH_SVE_LOOP_INC:%.*]] ]
				; CHECK-NEXT: [[MISMATCH_SVE_INDEX:%.]] = phi i64 [ [[TMP1]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP29:%.]], [[MISMATCH_SVE_LOOP_INC]] ]
				; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[MISMATCH_SVE_INDEX]]
				; CHECK-NEXT: [[TMP23:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP22]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[MISMATCH_SVE_INDEX]]
				; CHECK-NEXT: [[TMP25:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP24]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; CHECK-NEXT: [[TMP26:%.*]] = icmp ne <vscale x 16 x i8> [[TMP23]], [[TMP25]]
				; CHECK-NEXT: [[TMP27:%.*]] = select <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i1> [[TMP26]], <vscale x 16 x i1> zeroinitializer
				; CHECK-NEXT: [[TMP28:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP27]])
				; CHECK-NEXT: br i1 [[TMP28]], label [[MISMATCH_SVE_LOOP_FOUND:%.*]], label [[MISMATCH_SVE_LOOP_INC]]
				; CHECK: mismatch_sve_loop_inc:
				; CHECK-NEXT: [[TMP29]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP21]]
				; CHECK-NEXT: [[TMP30]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP29]], i64 [[TMP2]])
				; CHECK-NEXT: [[TMP31:%.*]] = extractelement <vscale x 16 x i1> [[TMP30]], i64 0
				; CHECK-NEXT: br i1 [[TMP31]], label [[MISMATCH_SVE_LOOP]], label [[MISMATCH_END:%.*]]
				; CHECK: mismatch_sve_loop_found:
				; CHECK-NEXT: [[TMP32:%.*]] = and <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], [[TMP27]]
				; CHECK-NEXT: [[TMP33:%.*]] = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> [[TMP32]], i32 1)
				; CHECK-NEXT: [[TMP34:%.*]] = zext i32 [[TMP33]] to i64
				; CHECK-NEXT: [[TMP35:%.*]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP34]]
				; CHECK-NEXT: [[TMP36:%.*]] = trunc i64 [[TMP35]] to i32
				; CHECK-NEXT: br label [[MISMATCH_END]]
				; CHECK: mismatch_loop_pre:
				; CHECK-NEXT: [[MISMATCH_START_INDEX:%.*]] = phi i32 [ [[TMP0]], [[MISMATCH_MEM_CHECK]] ], [ [[TMP0]], [[MISMATCH_MIN_IT_CHECK]] ]
				; CHECK-NEXT: br label [[MISMATCH_LOOP:%.*]]
				; CHECK: mismatch_loop:
				; CHECK-NEXT: [[MISMATCH_INDEX:%.]] = phi i32 [ [[MISMATCH_START_INDEX]], [[MISMATCH_LOOP_PRE]] ], [ [[TMP43:%.]], [[MISMATCH_LOOP_INC:%.*]] ]
				; CHECK-NEXT: [[TMP37:%.*]] = zext i32 [[MISMATCH_INDEX]] to i64
				; CHECK-NEXT: [[TMP38:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP37]]
				; CHECK-NEXT: [[TMP39:%.*]] = load i8, ptr [[TMP38]], align 1
				; CHECK-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP37]]
				; CHECK-NEXT: [[TMP41:%.*]] = load i8, ptr [[TMP40]], align 1
				; CHECK-NEXT: [[TMP42:%.*]] = icmp eq i8 [[TMP39]], [[TMP41]]
				; CHECK-NEXT: br i1 [[TMP42]], label [[MISMATCH_LOOP_INC]], label [[MISMATCH_END]]
				; CHECK: mismatch_loop_inc:
				; CHECK-NEXT: [[TMP43]] = add i32 [[MISMATCH_INDEX]], 1
				; CHECK-NEXT: [[TMP44:%.*]] = icmp eq i32 [[MISMATCH_INDEX]], [[N]]
				; CHECK-NEXT: br i1 [[TMP44]], label [[MISMATCH_END]], label [[MISMATCH_LOOP]]
				; CHECK: mismatch_end:
				; CHECK-NEXT: [[MISMATCH_RESULT:%.*]] = phi i32 [ [[N]], [[MISMATCH_LOOP_INC]] ], [ [[MISMATCH_INDEX]], [[MISMATCH_LOOP]] ], [ [[N]], [[MISMATCH_SVE_LOOP_INC]] ], [ [[TMP36]], [[MISMATCH_SVE_LOOP_FOUND]] ]
				; CHECK-NEXT: br i1 true, label [[BYTE_COMPARE:%.]], label [[WHILE_COND:%.]]
				; CHECK: while.cond:
				; CHECK-NEXT: [[LEN_ADDR:%.]] = phi i32 [ [[LEN]], [[MISMATCH_END]] ], [ [[MISMATCH_RESULT]], [[WHILE_BODY:%.]] ]
				; CHECK-NEXT: [[INC:%.*]] = add i32 [[MISMATCH_RESULT]], 1
				; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[MISMATCH_RESULT]], [[N]]
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; CHECK: while.body:
				; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[MISMATCH_RESULT]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP45:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP46:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; CHECK-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP45]], [[TMP46]]
				; CHECK-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; CHECK: byte.compare:
				; CHECK-NEXT: [[TMP47:%.*]] = icmp eq i32 [[MISMATCH_RESULT]], [[N]]
				; CHECK-NEXT: br i1 [[TMP47]], label [[WHILE_END]], label [[WHILE_END]]
				; CHECK: while.end:
				; CHECK-NEXT: [[INC_LCSSA:%.*]] = phi i32 [ [[MISMATCH_RESULT]], [[WHILE_BODY]] ], [ [[MISMATCH_RESULT]], [[WHILE_COND]] ], [ [[MISMATCH_RESULT]], [[BYTE_COMPARE]] ], [ [[MISMATCH_RESULT]], [[BYTE_COMPARE]] ]
				; CHECK-NEXT: ret i32 [[INC_LCSSA]]
				;
				; LOOP-DEL-LABEL: define i32 @compare_bytes_simple(
				; LOOP-DEL-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]]) #[[ATTR0:[0-9]+]] {
				; LOOP-DEL-NEXT: entry:
				; LOOP-DEL-NEXT: [[TMP0:%.*]] = add i32 [[LEN]], 1
				; LOOP-DEL-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
				; LOOP-DEL-NEXT: [[TMP2:%.*]] = zext i32 [[N]] to i64
				; LOOP-DEL-NEXT: [[TMP3:%.*]] = icmp ule i32 [[TMP0]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[TMP3]], label [[MISMATCH_MEM_CHECK:%.]], label [[MISMATCH_LOOP_PRE:%.]], !prof [[PROF0:![0-9]+]]
				; LOOP-DEL: mismatch_mem_check:
				; LOOP-DEL-NEXT: [[TMP4:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP1]]
				; LOOP-DEL-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP1]]
				; LOOP-DEL-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
				; LOOP-DEL-NEXT: [[TMP7:%.*]] = ptrtoint ptr [[TMP4]] to i64
				; LOOP-DEL-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP2]]
				; LOOP-DEL-NEXT: [[TMP9:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP2]]
				; LOOP-DEL-NEXT: [[TMP10:%.*]] = ptrtoint ptr [[TMP8]] to i64
				; LOOP-DEL-NEXT: [[TMP11:%.*]] = ptrtoint ptr [[TMP9]] to i64
				; LOOP-DEL-NEXT: [[TMP12:%.*]] = lshr i64 [[TMP7]], 12
				; LOOP-DEL-NEXT: [[TMP13:%.*]] = lshr i64 [[TMP10]], 12
				; LOOP-DEL-NEXT: [[TMP14:%.*]] = lshr i64 [[TMP6]], 12
				; LOOP-DEL-NEXT: [[TMP15:%.*]] = lshr i64 [[TMP11]], 12
				; LOOP-DEL-NEXT: [[TMP16:%.*]] = icmp ne i64 [[TMP12]], [[TMP13]]
				; LOOP-DEL-NEXT: [[TMP17:%.*]] = icmp ne i64 [[TMP14]], [[TMP15]]
				; LOOP-DEL-NEXT: [[TMP18:%.*]] = or i1 [[TMP16]], [[TMP17]]
				; LOOP-DEL-NEXT: br i1 [[TMP18]], label [[MISMATCH_LOOP_PRE]], label [[MISMATCH_SVE_LOOP_PREHEADER:%.*]], !prof [[PROF1:![0-9]+]]
				; LOOP-DEL: mismatch_sve_loop_preheader:
				; LOOP-DEL-NEXT: [[TMP19:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP1]], i64 [[TMP2]])
				; LOOP-DEL-NEXT: [[TMP20:%.*]] = call i64 @llvm.vscale.i64()
				; LOOP-DEL-NEXT: [[TMP21:%.*]] = mul nuw nsw i64 [[TMP20]], 16
				; LOOP-DEL-NEXT: br label [[MISMATCH_SVE_LOOP:%.*]]
				; LOOP-DEL: mismatch_sve_loop:
				; LOOP-DEL-NEXT: [[MISMATCH_SVE_LOOP_PRED:%.]] = phi <vscale x 16 x i1> [ [[TMP19]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP30:%.]], [[MISMATCH_SVE_LOOP_INC:%.*]] ]
				; LOOP-DEL-NEXT: [[MISMATCH_SVE_INDEX:%.]] = phi i64 [ [[TMP1]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP29:%.]], [[MISMATCH_SVE_LOOP_INC]] ]
				; LOOP-DEL-NEXT: [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[MISMATCH_SVE_INDEX]]
				; LOOP-DEL-NEXT: [[TMP23:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP22]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; LOOP-DEL-NEXT: [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[MISMATCH_SVE_INDEX]]
				; LOOP-DEL-NEXT: [[TMP25:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP24]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; LOOP-DEL-NEXT: [[TMP26:%.*]] = icmp ne <vscale x 16 x i8> [[TMP23]], [[TMP25]]
				; LOOP-DEL-NEXT: [[TMP27:%.*]] = select <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i1> [[TMP26]], <vscale x 16 x i1> zeroinitializer
				; LOOP-DEL-NEXT: [[TMP28:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP27]])
				; LOOP-DEL-NEXT: br i1 [[TMP28]], label [[MISMATCH_SVE_LOOP_FOUND:%.*]], label [[MISMATCH_SVE_LOOP_INC]]
				; LOOP-DEL: mismatch_sve_loop_inc:
				; LOOP-DEL-NEXT: [[TMP29]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP21]]
				; LOOP-DEL-NEXT: [[TMP30]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP29]], i64 [[TMP2]])
				; LOOP-DEL-NEXT: [[TMP31:%.*]] = extractelement <vscale x 16 x i1> [[TMP30]], i64 0
				; LOOP-DEL-NEXT: br i1 [[TMP31]], label [[MISMATCH_SVE_LOOP]], label [[WHILE_END:%.*]]
				; LOOP-DEL: mismatch_sve_loop_found:
				; LOOP-DEL-NEXT: [[TMP32:%.*]] = and <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], [[TMP27]]
				; LOOP-DEL-NEXT: [[TMP33:%.*]] = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> [[TMP32]], i32 1)
				; LOOP-DEL-NEXT: [[TMP34:%.*]] = zext i32 [[TMP33]] to i64
				; LOOP-DEL-NEXT: [[TMP35:%.*]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP34]]
				; LOOP-DEL-NEXT: [[TMP36:%.*]] = trunc i64 [[TMP35]] to i32
				; LOOP-DEL-NEXT: br label [[WHILE_END]]
				; LOOP-DEL: mismatch_loop_pre:
				; LOOP-DEL-NEXT: [[MISMATCH_START_INDEX:%.]] = phi i32 [ [[TMP0]], [[MISMATCH_MEM_CHECK]] ], [ [[TMP0]], [[ENTRY:%.]] ]
				; LOOP-DEL-NEXT: br label [[MISMATCH_LOOP:%.*]]
				; LOOP-DEL: mismatch_loop:
				; LOOP-DEL-NEXT: [[MISMATCH_INDEX:%.]] = phi i32 [ [[MISMATCH_START_INDEX]], [[MISMATCH_LOOP_PRE]] ], [ [[TMP43:%.]], [[MISMATCH_LOOP_INC:%.*]] ]
				; LOOP-DEL-NEXT: [[TMP37:%.*]] = zext i32 [[MISMATCH_INDEX]] to i64
				; LOOP-DEL-NEXT: [[TMP38:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP37]]
				; LOOP-DEL-NEXT: [[TMP39:%.*]] = load i8, ptr [[TMP38]], align 1
				; LOOP-DEL-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP37]]
				; LOOP-DEL-NEXT: [[TMP41:%.*]] = load i8, ptr [[TMP40]], align 1
				; LOOP-DEL-NEXT: [[TMP42:%.*]] = icmp eq i8 [[TMP39]], [[TMP41]]
				; LOOP-DEL-NEXT: br i1 [[TMP42]], label [[MISMATCH_LOOP_INC]], label [[WHILE_END]]
				; LOOP-DEL: mismatch_loop_inc:
				; LOOP-DEL-NEXT: [[TMP43]] = add i32 [[MISMATCH_INDEX]], 1
				; LOOP-DEL-NEXT: [[TMP44:%.*]] = icmp eq i32 [[MISMATCH_INDEX]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[TMP44]], label [[WHILE_END]], label [[MISMATCH_LOOP]]
				; LOOP-DEL: while.end:
				; LOOP-DEL-NEXT: [[MISMATCH_RESULT:%.*]] = phi i32 [ [[N]], [[MISMATCH_LOOP_INC]] ], [ [[MISMATCH_INDEX]], [[MISMATCH_LOOP]] ], [ [[N]], [[MISMATCH_SVE_LOOP_INC]] ], [ [[TMP36]], [[MISMATCH_SVE_LOOP_FOUND]] ]
				; LOOP-DEL-NEXT: ret i32 [[MISMATCH_RESULT]]
				;
				entry:
				br label %while.cond

				while.cond:
				%len.addr = phi i32 [ %len, %entry ], [ %inc, %while.body ]
				%inc = add i32 %len.addr, 1
				%cmp.not = icmp eq i32 %inc, %n
				br i1 %cmp.not, label %while.end, label %while.body

				while.body:
				%idxprom = zext i32 %inc to i64
				%arrayidx = getelementptr inbounds i8, ptr %a, i64 %idxprom
				%0 = load i8, ptr %arrayidx
				%arrayidx2 = getelementptr inbounds i8, ptr %b, i64 %idxprom
				%1 = load i8, ptr %arrayidx2
				%cmp.not2 = icmp eq i8 %0, %1
				br i1 %cmp.not2, label %while.cond, label %while.end

				while.end:
				%inc.lcssa = phi i32 [ %inc, %while.body ], [ %inc, %while.cond ]
				ret i32 %inc.lcssa
				}

				define i32 @compare_bytes_umin(ptr %a, ptr %b, i32 %len, i32 %n, i32 %idx1, i32 %idx2) {
				; CHECK-LABEL: define i32 @compare_bytes_umin(
				; CHECK-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]], i32 [[IDX1:%.]], i32 [[IDX2:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[PH:%.*]]
				; CHECK: ph:
				; CHECK-NEXT: [[START:%.*]] = call i32 @llvm.umin.i32(i32 [[IDX1]], i32 [[IDX2]])
				; CHECK-NEXT: [[EXT:%.*]] = zext i32 [[START]] to i64
				; CHECK-NEXT: [[A0:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[EXT]]
				; CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[A0]], align 1
				; CHECK-NEXT: [[A1:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[EXT]]
				; CHECK-NEXT: [[TMP1:%.*]] = load i8, ptr [[A1]], align 1
				; CHECK-NEXT: [[CMP:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; CHECK-NEXT: br i1 [[CMP]], label [[WHILE_COND_PREHEADER:%.]], label [[WHILE_END:%.]]
				; CHECK: while.cond.preheader:
				; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[START]], 1
				; CHECK-NEXT: br label [[MISMATCH_MIN_IT_CHECK:%.*]]
				; CHECK: mismatch_min_it_check:
				; CHECK-NEXT: [[TMP3:%.*]] = zext i32 [[TMP2]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = zext i32 [[N]] to i64
				; CHECK-NEXT: [[TMP5:%.*]] = icmp ule i32 [[TMP2]], [[N]]
				; CHECK-NEXT: br i1 [[TMP5]], label [[MISMATCH_MEM_CHECK:%.]], label [[MISMATCH_LOOP_PRE:%.]], !prof [[PROF0]]
				; CHECK: mismatch_mem_check:
				; CHECK-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP3]]
				; CHECK-NEXT: [[TMP7:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP3]]
				; CHECK-NEXT: [[TMP8:%.*]] = ptrtoint ptr [[TMP7]] to i64
				; CHECK-NEXT: [[TMP9:%.*]] = ptrtoint ptr [[TMP6]] to i64
				; CHECK-NEXT: [[TMP10:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP4]]
				; CHECK-NEXT: [[TMP11:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP4]]
				; CHECK-NEXT: [[TMP12:%.*]] = ptrtoint ptr [[TMP10]] to i64
				; CHECK-NEXT: [[TMP13:%.*]] = ptrtoint ptr [[TMP11]] to i64
				; CHECK-NEXT: [[TMP14:%.*]] = lshr i64 [[TMP9]], 12
				; CHECK-NEXT: [[TMP15:%.*]] = lshr i64 [[TMP12]], 12
				; CHECK-NEXT: [[TMP16:%.*]] = lshr i64 [[TMP8]], 12
				; CHECK-NEXT: [[TMP17:%.*]] = lshr i64 [[TMP13]], 12
				; CHECK-NEXT: [[TMP18:%.*]] = icmp ne i64 [[TMP14]], [[TMP15]]
				; CHECK-NEXT: [[TMP19:%.*]] = icmp ne i64 [[TMP16]], [[TMP17]]
				; CHECK-NEXT: [[TMP20:%.*]] = or i1 [[TMP18]], [[TMP19]]
				; CHECK-NEXT: br i1 [[TMP20]], label [[MISMATCH_LOOP_PRE]], label [[MISMATCH_SVE_LOOP_PREHEADER:%.*]], !prof [[PROF1]]
				; CHECK: mismatch_sve_loop_preheader:
				; CHECK-NEXT: [[TMP21:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP3]], i64 [[TMP4]])
				; CHECK-NEXT: [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP23:%.*]] = mul nuw nsw i64 [[TMP22]], 16
				; CHECK-NEXT: br label [[MISMATCH_SVE_LOOP:%.*]]
				; CHECK: mismatch_sve_loop:
				; CHECK-NEXT: [[MISMATCH_SVE_LOOP_PRED:%.]] = phi <vscale x 16 x i1> [ [[TMP21]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP32:%.]], [[MISMATCH_SVE_LOOP_INC:%.*]] ]
				; CHECK-NEXT: [[MISMATCH_SVE_INDEX:%.]] = phi i64 [ [[TMP3]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP31:%.]], [[MISMATCH_SVE_LOOP_INC]] ]
				; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[MISMATCH_SVE_INDEX]]
				; CHECK-NEXT: [[TMP25:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP24]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; CHECK-NEXT: [[TMP26:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[MISMATCH_SVE_INDEX]]
				; CHECK-NEXT: [[TMP27:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP26]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; CHECK-NEXT: [[TMP28:%.*]] = icmp ne <vscale x 16 x i8> [[TMP25]], [[TMP27]]
				; CHECK-NEXT: [[TMP29:%.*]] = select <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i1> [[TMP28]], <vscale x 16 x i1> zeroinitializer
				; CHECK-NEXT: [[TMP30:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP29]])
				; CHECK-NEXT: br i1 [[TMP30]], label [[MISMATCH_SVE_LOOP_FOUND:%.*]], label [[MISMATCH_SVE_LOOP_INC]]
				; CHECK: mismatch_sve_loop_inc:
				; CHECK-NEXT: [[TMP31]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP23]]
				; CHECK-NEXT: [[TMP32]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP31]], i64 [[TMP4]])
				; CHECK-NEXT: [[TMP33:%.*]] = extractelement <vscale x 16 x i1> [[TMP32]], i64 0
				; CHECK-NEXT: br i1 [[TMP33]], label [[MISMATCH_SVE_LOOP]], label [[MISMATCH_END:%.*]]
				; CHECK: mismatch_sve_loop_found:
				; CHECK-NEXT: [[TMP34:%.*]] = and <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], [[TMP29]]
				; CHECK-NEXT: [[TMP35:%.*]] = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> [[TMP34]], i32 1)
				; CHECK-NEXT: [[TMP36:%.*]] = zext i32 [[TMP35]] to i64
				; CHECK-NEXT: [[TMP37:%.*]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP36]]
				; CHECK-NEXT: [[TMP38:%.*]] = trunc i64 [[TMP37]] to i32
				; CHECK-NEXT: br label [[MISMATCH_END]]
				; CHECK: mismatch_loop_pre:
				; CHECK-NEXT: [[MISMATCH_START_INDEX:%.*]] = phi i32 [ [[TMP2]], [[MISMATCH_MEM_CHECK]] ], [ [[TMP2]], [[MISMATCH_MIN_IT_CHECK]] ]
				; CHECK-NEXT: br label [[MISMATCH_LOOP:%.*]]
				; CHECK: mismatch_loop:
				; CHECK-NEXT: [[MISMATCH_INDEX:%.]] = phi i32 [ [[MISMATCH_START_INDEX]], [[MISMATCH_LOOP_PRE]] ], [ [[TMP45:%.]], [[MISMATCH_LOOP_INC:%.*]] ]
				; CHECK-NEXT: [[TMP39:%.*]] = zext i32 [[MISMATCH_INDEX]] to i64
				; CHECK-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP39]]
				; CHECK-NEXT: [[TMP41:%.*]] = load i8, ptr [[TMP40]], align 1
				; CHECK-NEXT: [[TMP42:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP39]]
				; CHECK-NEXT: [[TMP43:%.*]] = load i8, ptr [[TMP42]], align 1
				; CHECK-NEXT: [[TMP44:%.*]] = icmp eq i8 [[TMP41]], [[TMP43]]
				; CHECK-NEXT: br i1 [[TMP44]], label [[MISMATCH_LOOP_INC]], label [[MISMATCH_END]]
				; CHECK: mismatch_loop_inc:
				; CHECK-NEXT: [[TMP45]] = add i32 [[MISMATCH_INDEX]], 1
				; CHECK-NEXT: [[TMP46:%.*]] = icmp eq i32 [[MISMATCH_INDEX]], [[N]]
				; CHECK-NEXT: br i1 [[TMP46]], label [[MISMATCH_END]], label [[MISMATCH_LOOP]]
				; CHECK: mismatch_end:
				; CHECK-NEXT: [[MISMATCH_RESULT:%.*]] = phi i32 [ [[N]], [[MISMATCH_LOOP_INC]] ], [ [[MISMATCH_INDEX]], [[MISMATCH_LOOP]] ], [ [[N]], [[MISMATCH_SVE_LOOP_INC]] ], [ [[TMP38]], [[MISMATCH_SVE_LOOP_FOUND]] ]
				; CHECK-NEXT: br i1 true, label [[BYTE_COMPARE:%.]], label [[WHILE_COND:%.]]
				; CHECK: while.cond:
				; CHECK-NEXT: [[LEN_PHI:%.]] = phi i32 [ [[START]], [[MISMATCH_END]] ], [ [[MISMATCH_RESULT]], [[WHILE_BODY:%.]] ]
				; CHECK-NEXT: [[INC:%.*]] = add i32 [[MISMATCH_RESULT]], 1
				; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[MISMATCH_RESULT]], [[N]]
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END]], label [[WHILE_BODY]]
				; CHECK: while.body:
				; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[MISMATCH_RESULT]] to i64
				; CHECK-NEXT: [[IDX_A:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP47:%.*]] = load i8, ptr [[IDX_A]], align 1
				; CHECK-NEXT: [[IDX_B:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP48:%.*]] = load i8, ptr [[IDX_B]], align 1
				; CHECK-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP47]], [[TMP48]]
				; CHECK-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; CHECK: byte.compare:
				; CHECK-NEXT: [[TMP49:%.*]] = icmp eq i32 [[MISMATCH_RESULT]], [[N]]
				; CHECK-NEXT: br i1 [[TMP49]], label [[WHILE_END]], label [[WHILE_END]]
				; CHECK: while.end:
				; CHECK-NEXT: [[RES:%.*]] = phi i32 [ [[N]], [[PH]] ], [ [[MISMATCH_RESULT]], [[WHILE_COND]] ], [ [[MISMATCH_RESULT]], [[WHILE_BODY]] ], [ [[MISMATCH_RESULT]], [[BYTE_COMPARE]] ], [ [[MISMATCH_RESULT]], [[BYTE_COMPARE]] ]
				; CHECK-NEXT: ret i32 [[RES]]
				;
				; LOOP-DEL-LABEL: define i32 @compare_bytes_umin(
				; LOOP-DEL-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]], i32 [[IDX1:%.]], i32 [[IDX2:%.]]) #[[ATTR0]] {
				; LOOP-DEL-NEXT: entry:
				; LOOP-DEL-NEXT: [[START:%.*]] = call i32 @llvm.umin.i32(i32 [[IDX1]], i32 [[IDX2]])
				; LOOP-DEL-NEXT: [[EXT:%.*]] = zext i32 [[START]] to i64
				; LOOP-DEL-NEXT: [[A0:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[EXT]]
				; LOOP-DEL-NEXT: [[TMP0:%.*]] = load i8, ptr [[A0]], align 1
				; LOOP-DEL-NEXT: [[A1:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[EXT]]
				; LOOP-DEL-NEXT: [[TMP1:%.*]] = load i8, ptr [[A1]], align 1
				; LOOP-DEL-NEXT: [[CMP:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; LOOP-DEL-NEXT: br i1 [[CMP]], label [[WHILE_COND_PREHEADER:%.]], label [[WHILE_END:%.]]
				; LOOP-DEL: while.cond.preheader:
				; LOOP-DEL-NEXT: [[TMP2:%.*]] = add i32 [[START]], 1
				; LOOP-DEL-NEXT: [[TMP3:%.*]] = zext i32 [[TMP2]] to i64
				; LOOP-DEL-NEXT: [[TMP4:%.*]] = zext i32 [[N]] to i64
				; LOOP-DEL-NEXT: [[TMP5:%.*]] = icmp ule i32 [[TMP2]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[TMP5]], label [[MISMATCH_MEM_CHECK:%.]], label [[MISMATCH_LOOP_PRE:%.]], !prof [[PROF0]]
				; LOOP-DEL: mismatch_mem_check:
				; LOOP-DEL-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP3]]
				; LOOP-DEL-NEXT: [[TMP7:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP3]]
				; LOOP-DEL-NEXT: [[TMP8:%.*]] = ptrtoint ptr [[TMP7]] to i64
				; LOOP-DEL-NEXT: [[TMP9:%.*]] = ptrtoint ptr [[TMP6]] to i64
				; LOOP-DEL-NEXT: [[TMP10:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP4]]
				; LOOP-DEL-NEXT: [[TMP11:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP4]]
				; LOOP-DEL-NEXT: [[TMP12:%.*]] = ptrtoint ptr [[TMP10]] to i64
				; LOOP-DEL-NEXT: [[TMP13:%.*]] = ptrtoint ptr [[TMP11]] to i64
				; LOOP-DEL-NEXT: [[TMP14:%.*]] = lshr i64 [[TMP9]], 12
				; LOOP-DEL-NEXT: [[TMP15:%.*]] = lshr i64 [[TMP12]], 12
				; LOOP-DEL-NEXT: [[TMP16:%.*]] = lshr i64 [[TMP8]], 12
				; LOOP-DEL-NEXT: [[TMP17:%.*]] = lshr i64 [[TMP13]], 12
				; LOOP-DEL-NEXT: [[TMP18:%.*]] = icmp ne i64 [[TMP14]], [[TMP15]]
				; LOOP-DEL-NEXT: [[TMP19:%.*]] = icmp ne i64 [[TMP16]], [[TMP17]]
				; LOOP-DEL-NEXT: [[TMP20:%.*]] = or i1 [[TMP18]], [[TMP19]]
				; LOOP-DEL-NEXT: br i1 [[TMP20]], label [[MISMATCH_LOOP_PRE]], label [[MISMATCH_SVE_LOOP_PREHEADER:%.*]], !prof [[PROF1]]
				; LOOP-DEL: mismatch_sve_loop_preheader:
				; LOOP-DEL-NEXT: [[TMP21:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP3]], i64 [[TMP4]])
				; LOOP-DEL-NEXT: [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
				; LOOP-DEL-NEXT: [[TMP23:%.*]] = mul nuw nsw i64 [[TMP22]], 16
				; LOOP-DEL-NEXT: br label [[MISMATCH_SVE_LOOP:%.*]]
				; LOOP-DEL: mismatch_sve_loop:
				; LOOP-DEL-NEXT: [[MISMATCH_SVE_LOOP_PRED:%.]] = phi <vscale x 16 x i1> [ [[TMP21]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP32:%.]], [[MISMATCH_SVE_LOOP_INC:%.*]] ]
				; LOOP-DEL-NEXT: [[MISMATCH_SVE_INDEX:%.]] = phi i64 [ [[TMP3]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP31:%.]], [[MISMATCH_SVE_LOOP_INC]] ]
				; LOOP-DEL-NEXT: [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[MISMATCH_SVE_INDEX]]
				; LOOP-DEL-NEXT: [[TMP25:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP24]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; LOOP-DEL-NEXT: [[TMP26:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[MISMATCH_SVE_INDEX]]
				; LOOP-DEL-NEXT: [[TMP27:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP26]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; LOOP-DEL-NEXT: [[TMP28:%.*]] = icmp ne <vscale x 16 x i8> [[TMP25]], [[TMP27]]
				; LOOP-DEL-NEXT: [[TMP29:%.*]] = select <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i1> [[TMP28]], <vscale x 16 x i1> zeroinitializer
				; LOOP-DEL-NEXT: [[TMP30:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP29]])
				; LOOP-DEL-NEXT: br i1 [[TMP30]], label [[MISMATCH_SVE_LOOP_FOUND:%.*]], label [[MISMATCH_SVE_LOOP_INC]]
				; LOOP-DEL: mismatch_sve_loop_inc:
				; LOOP-DEL-NEXT: [[TMP31]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP23]]
				; LOOP-DEL-NEXT: [[TMP32]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP31]], i64 [[TMP4]])
				; LOOP-DEL-NEXT: [[TMP33:%.*]] = extractelement <vscale x 16 x i1> [[TMP32]], i64 0
				; LOOP-DEL-NEXT: br i1 [[TMP33]], label [[MISMATCH_SVE_LOOP]], label [[WHILE_END]]
				; LOOP-DEL: mismatch_sve_loop_found:
				; LOOP-DEL-NEXT: [[TMP34:%.*]] = and <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], [[TMP29]]
				; LOOP-DEL-NEXT: [[TMP35:%.*]] = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> [[TMP34]], i32 1)
				; LOOP-DEL-NEXT: [[TMP36:%.*]] = zext i32 [[TMP35]] to i64
				; LOOP-DEL-NEXT: [[TMP37:%.*]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP36]]
				; LOOP-DEL-NEXT: [[TMP38:%.*]] = trunc i64 [[TMP37]] to i32
				; LOOP-DEL-NEXT: br label [[WHILE_END]]
				; LOOP-DEL: mismatch_loop_pre:
				; LOOP-DEL-NEXT: [[MISMATCH_START_INDEX:%.*]] = phi i32 [ [[TMP2]], [[MISMATCH_MEM_CHECK]] ], [ [[TMP2]], [[WHILE_COND_PREHEADER]] ]
				; LOOP-DEL-NEXT: br label [[MISMATCH_LOOP:%.*]]
				; LOOP-DEL: mismatch_loop:
				; LOOP-DEL-NEXT: [[MISMATCH_INDEX:%.]] = phi i32 [ [[MISMATCH_START_INDEX]], [[MISMATCH_LOOP_PRE]] ], [ [[TMP45:%.]], [[MISMATCH_LOOP_INC:%.*]] ]
				; LOOP-DEL-NEXT: [[TMP39:%.*]] = zext i32 [[MISMATCH_INDEX]] to i64
				; LOOP-DEL-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP39]]
				; LOOP-DEL-NEXT: [[TMP41:%.*]] = load i8, ptr [[TMP40]], align 1
				; LOOP-DEL-NEXT: [[TMP42:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP39]]
				; LOOP-DEL-NEXT: [[TMP43:%.*]] = load i8, ptr [[TMP42]], align 1
				; LOOP-DEL-NEXT: [[TMP44:%.*]] = icmp eq i8 [[TMP41]], [[TMP43]]
				; LOOP-DEL-NEXT: br i1 [[TMP44]], label [[MISMATCH_LOOP_INC]], label [[WHILE_END]]
				; LOOP-DEL: mismatch_loop_inc:
				; LOOP-DEL-NEXT: [[TMP45]] = add i32 [[MISMATCH_INDEX]], 1
				; LOOP-DEL-NEXT: [[TMP46:%.*]] = icmp eq i32 [[MISMATCH_INDEX]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[TMP46]], label [[WHILE_END]], label [[MISMATCH_LOOP]]
				; LOOP-DEL: while.end:
				; LOOP-DEL-NEXT: [[RES:%.]] = phi i32 [ [[N]], [[ENTRY:%.]] ], [ [[N]], [[MISMATCH_LOOP_INC]] ], [ [[MISMATCH_INDEX]], [[MISMATCH_LOOP]] ], [ [[N]], [[MISMATCH_SVE_LOOP_INC]] ], [ [[TMP38]], [[MISMATCH_SVE_LOOP_FOUND]] ]
				; LOOP-DEL-NEXT: ret i32 [[RES]]
				;
				entry:
				br label %ph

				ph:
				%start = call i32 @llvm.umin.i32(i32 %idx1, i32 %idx2)
				%ext = zext i32 %start to i64
				%a0 = getelementptr inbounds i8, ptr %a, i64 %ext
				%0 = load i8, ptr %a0, align 1
				%a1 = getelementptr inbounds i8, ptr %b, i64 %ext
				%1 = load i8, ptr %a1, align 1
				%cmp = icmp eq i8 %0, %1
				br i1 %cmp, label %while.cond.preheader, label %while.end

				while.cond.preheader:
				br label %while.cond

				while.cond:
				%len.phi = phi i32 [ %start, %while.cond.preheader ], [ %inc, %while.body ]
				%inc = add i32 %len.phi, 1
				%cmp.not = icmp eq i32 %inc, %n
				br i1 %cmp.not, label %while.end, label %while.body

				while.body:
				%idxprom = zext i32 %inc to i64
				%idx.a = getelementptr inbounds i8, ptr %a, i64 %idxprom
				%2 = load i8, ptr %idx.a, align 1
				%idx.b = getelementptr inbounds i8, ptr %b, i64 %idxprom
				%3 = load i8, ptr %idx.b, align 1
				%cmp.not2 = icmp eq i8 %2, %3
				br i1 %cmp.not2, label %while.cond, label %while.end

				while.end:
				%res = phi i32 [ %n, %ph], [ %inc, %while.cond], [ %inc, %while.body ]
				ret i32 %res
				}

				declare i32 @llvm.umin.i32(i32, i32);

				define i32 @compare_bytes_extra_cmp(ptr %a, ptr %b, i32 %len, i32 %n, i32 %x) {
				; CHECK-LABEL: define i32 @compare_bytes_extra_cmp(
				; CHECK-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]], i32 [[X:%.*]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[CMP_X:%.*]] = icmp ult i32 [[N]], [[X]]
				; CHECK-NEXT: br i1 [[CMP_X]], label [[PH:%.]], label [[WHILE_END:%.]]
				; CHECK: ph:
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[LEN]], 1
				; CHECK-NEXT: br label [[MISMATCH_MIN_IT_CHECK:%.*]]
				; CHECK: mismatch_min_it_check:
				; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = zext i32 [[N]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = icmp ule i32 [[TMP0]], [[N]]
				; CHECK-NEXT: br i1 [[TMP3]], label [[MISMATCH_MEM_CHECK:%.]], label [[MISMATCH_LOOP_PRE:%.]], !prof [[PROF0]]
				; CHECK: mismatch_mem_check:
				; CHECK-NEXT: [[TMP4:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
				; CHECK-NEXT: [[TMP7:%.*]] = ptrtoint ptr [[TMP4]] to i64
				; CHECK-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP2]]
				; CHECK-NEXT: [[TMP9:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP2]]
				; CHECK-NEXT: [[TMP10:%.*]] = ptrtoint ptr [[TMP8]] to i64
				; CHECK-NEXT: [[TMP11:%.*]] = ptrtoint ptr [[TMP9]] to i64
				; CHECK-NEXT: [[TMP12:%.*]] = lshr i64 [[TMP7]], 12
				; CHECK-NEXT: [[TMP13:%.*]] = lshr i64 [[TMP10]], 12
				; CHECK-NEXT: [[TMP14:%.*]] = lshr i64 [[TMP6]], 12
				; CHECK-NEXT: [[TMP15:%.*]] = lshr i64 [[TMP11]], 12
				; CHECK-NEXT: [[TMP16:%.*]] = icmp ne i64 [[TMP12]], [[TMP13]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp ne i64 [[TMP14]], [[TMP15]]
				; CHECK-NEXT: [[TMP18:%.*]] = or i1 [[TMP16]], [[TMP17]]
				; CHECK-NEXT: br i1 [[TMP18]], label [[MISMATCH_LOOP_PRE]], label [[MISMATCH_SVE_LOOP_PREHEADER:%.*]], !prof [[PROF1]]
				; CHECK: mismatch_sve_loop_preheader:
				; CHECK-NEXT: [[TMP19:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP1]], i64 [[TMP2]])
				; CHECK-NEXT: [[TMP20:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP21:%.*]] = mul nuw nsw i64 [[TMP20]], 16
				; CHECK-NEXT: br label [[MISMATCH_SVE_LOOP:%.*]]
				; CHECK: mismatch_sve_loop:
				; CHECK-NEXT: [[MISMATCH_SVE_LOOP_PRED:%.]] = phi <vscale x 16 x i1> [ [[TMP19]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP30:%.]], [[MISMATCH_SVE_LOOP_INC:%.*]] ]
				; CHECK-NEXT: [[MISMATCH_SVE_INDEX:%.]] = phi i64 [ [[TMP1]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP29:%.]], [[MISMATCH_SVE_LOOP_INC]] ]
				; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[MISMATCH_SVE_INDEX]]
				; CHECK-NEXT: [[TMP23:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP22]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[MISMATCH_SVE_INDEX]]
				; CHECK-NEXT: [[TMP25:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP24]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; CHECK-NEXT: [[TMP26:%.*]] = icmp ne <vscale x 16 x i8> [[TMP23]], [[TMP25]]
				; CHECK-NEXT: [[TMP27:%.*]] = select <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i1> [[TMP26]], <vscale x 16 x i1> zeroinitializer
				; CHECK-NEXT: [[TMP28:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP27]])
				; CHECK-NEXT: br i1 [[TMP28]], label [[MISMATCH_SVE_LOOP_FOUND:%.*]], label [[MISMATCH_SVE_LOOP_INC]]
				; CHECK: mismatch_sve_loop_inc:
				; CHECK-NEXT: [[TMP29]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP21]]
				; CHECK-NEXT: [[TMP30]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP29]], i64 [[TMP2]])
				; CHECK-NEXT: [[TMP31:%.*]] = extractelement <vscale x 16 x i1> [[TMP30]], i64 0
				; CHECK-NEXT: br i1 [[TMP31]], label [[MISMATCH_SVE_LOOP]], label [[MISMATCH_END:%.*]]
				; CHECK: mismatch_sve_loop_found:
				; CHECK-NEXT: [[TMP32:%.*]] = and <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], [[TMP27]]
				; CHECK-NEXT: [[TMP33:%.*]] = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> [[TMP32]], i32 1)
				; CHECK-NEXT: [[TMP34:%.*]] = zext i32 [[TMP33]] to i64
				; CHECK-NEXT: [[TMP35:%.*]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP34]]
				; CHECK-NEXT: [[TMP36:%.*]] = trunc i64 [[TMP35]] to i32
				; CHECK-NEXT: br label [[MISMATCH_END]]
				; CHECK: mismatch_loop_pre:
				; CHECK-NEXT: [[MISMATCH_START_INDEX:%.*]] = phi i32 [ [[TMP0]], [[MISMATCH_MEM_CHECK]] ], [ [[TMP0]], [[MISMATCH_MIN_IT_CHECK]] ]
				; CHECK-NEXT: br label [[MISMATCH_LOOP:%.*]]
				; CHECK: mismatch_loop:
				; CHECK-NEXT: [[MISMATCH_INDEX:%.]] = phi i32 [ [[MISMATCH_START_INDEX]], [[MISMATCH_LOOP_PRE]] ], [ [[TMP43:%.]], [[MISMATCH_LOOP_INC:%.*]] ]
				; CHECK-NEXT: [[TMP37:%.*]] = zext i32 [[MISMATCH_INDEX]] to i64
				; CHECK-NEXT: [[TMP38:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP37]]
				; CHECK-NEXT: [[TMP39:%.*]] = load i8, ptr [[TMP38]], align 1
				; CHECK-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP37]]
				; CHECK-NEXT: [[TMP41:%.*]] = load i8, ptr [[TMP40]], align 1
				; CHECK-NEXT: [[TMP42:%.*]] = icmp eq i8 [[TMP39]], [[TMP41]]
				; CHECK-NEXT: br i1 [[TMP42]], label [[MISMATCH_LOOP_INC]], label [[MISMATCH_END]]
				; CHECK: mismatch_loop_inc:
				; CHECK-NEXT: [[TMP43]] = add i32 [[MISMATCH_INDEX]], 1
				; CHECK-NEXT: [[TMP44:%.*]] = icmp eq i32 [[MISMATCH_INDEX]], [[N]]
				; CHECK-NEXT: br i1 [[TMP44]], label [[MISMATCH_END]], label [[MISMATCH_LOOP]]
				; CHECK: mismatch_end:
				; CHECK-NEXT: [[MISMATCH_RESULT:%.*]] = phi i32 [ [[N]], [[MISMATCH_LOOP_INC]] ], [ [[MISMATCH_INDEX]], [[MISMATCH_LOOP]] ], [ [[N]], [[MISMATCH_SVE_LOOP_INC]] ], [ [[TMP36]], [[MISMATCH_SVE_LOOP_FOUND]] ]
				; CHECK-NEXT: br i1 true, label [[BYTE_COMPARE:%.]], label [[WHILE_COND:%.]]
				; CHECK: while.cond:
				; CHECK-NEXT: [[LEN_ADDR:%.]] = phi i32 [ [[LEN]], [[MISMATCH_END]] ], [ [[MISMATCH_RESULT]], [[WHILE_BODY:%.]] ]
				; CHECK-NEXT: [[INC:%.*]] = add i32 [[MISMATCH_RESULT]], 1
				; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[MISMATCH_RESULT]], [[N]]
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END]], label [[WHILE_BODY]]
				; CHECK: while.body:
				; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[MISMATCH_RESULT]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP45:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP46:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; CHECK-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP45]], [[TMP46]]
				; CHECK-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; CHECK: byte.compare:
				; CHECK-NEXT: [[TMP47:%.*]] = icmp eq i32 [[MISMATCH_RESULT]], [[N]]
				; CHECK-NEXT: br i1 [[TMP47]], label [[WHILE_END]], label [[WHILE_END]]
				; CHECK: while.end:
				; CHECK-NEXT: [[INC_LCSSA:%.]] = phi i32 [ [[MISMATCH_RESULT]], [[WHILE_BODY]] ], [ [[MISMATCH_RESULT]], [[WHILE_COND]] ], [ [[X]], [[ENTRY:%.]] ], [ [[MISMATCH_RESULT]], [[BYTE_COMPARE]] ], [ [[MISMATCH_RESULT]], [[BYTE_COMPARE]] ]
				; CHECK-NEXT: ret i32 [[INC_LCSSA]]
				;
				; LOOP-DEL-LABEL: define i32 @compare_bytes_extra_cmp(
				; LOOP-DEL-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]], i32 [[X:%.*]]) #[[ATTR0]] {
				; LOOP-DEL-NEXT: entry:
				; LOOP-DEL-NEXT: [[CMP_X:%.*]] = icmp ult i32 [[N]], [[X]]
				; LOOP-DEL-NEXT: br i1 [[CMP_X]], label [[PH:%.]], label [[WHILE_END:%.]]
				; LOOP-DEL: ph:
				; LOOP-DEL-NEXT: [[TMP0:%.*]] = add i32 [[LEN]], 1
				; LOOP-DEL-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
				; LOOP-DEL-NEXT: [[TMP2:%.*]] = zext i32 [[N]] to i64
				; LOOP-DEL-NEXT: [[TMP3:%.*]] = icmp ule i32 [[TMP0]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[TMP3]], label [[MISMATCH_MEM_CHECK:%.]], label [[MISMATCH_LOOP_PRE:%.]], !prof [[PROF0]]
				; LOOP-DEL: mismatch_mem_check:
				; LOOP-DEL-NEXT: [[TMP4:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP1]]
				; LOOP-DEL-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP1]]
				; LOOP-DEL-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP5]] to i64
				; LOOP-DEL-NEXT: [[TMP7:%.*]] = ptrtoint ptr [[TMP4]] to i64
				; LOOP-DEL-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP2]]
				; LOOP-DEL-NEXT: [[TMP9:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP2]]
				; LOOP-DEL-NEXT: [[TMP10:%.*]] = ptrtoint ptr [[TMP8]] to i64
				; LOOP-DEL-NEXT: [[TMP11:%.*]] = ptrtoint ptr [[TMP9]] to i64
				; LOOP-DEL-NEXT: [[TMP12:%.*]] = lshr i64 [[TMP7]], 12
				; LOOP-DEL-NEXT: [[TMP13:%.*]] = lshr i64 [[TMP10]], 12
				; LOOP-DEL-NEXT: [[TMP14:%.*]] = lshr i64 [[TMP6]], 12
				; LOOP-DEL-NEXT: [[TMP15:%.*]] = lshr i64 [[TMP11]], 12
				; LOOP-DEL-NEXT: [[TMP16:%.*]] = icmp ne i64 [[TMP12]], [[TMP13]]
				; LOOP-DEL-NEXT: [[TMP17:%.*]] = icmp ne i64 [[TMP14]], [[TMP15]]
				; LOOP-DEL-NEXT: [[TMP18:%.*]] = or i1 [[TMP16]], [[TMP17]]
				; LOOP-DEL-NEXT: br i1 [[TMP18]], label [[MISMATCH_LOOP_PRE]], label [[MISMATCH_SVE_LOOP_PREHEADER:%.*]], !prof [[PROF1]]
				; LOOP-DEL: mismatch_sve_loop_preheader:
				; LOOP-DEL-NEXT: [[TMP19:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP1]], i64 [[TMP2]])
				; LOOP-DEL-NEXT: [[TMP20:%.*]] = call i64 @llvm.vscale.i64()
				; LOOP-DEL-NEXT: [[TMP21:%.*]] = mul nuw nsw i64 [[TMP20]], 16
				; LOOP-DEL-NEXT: br label [[MISMATCH_SVE_LOOP:%.*]]
				; LOOP-DEL: mismatch_sve_loop:
				; LOOP-DEL-NEXT: [[MISMATCH_SVE_LOOP_PRED:%.]] = phi <vscale x 16 x i1> [ [[TMP19]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP30:%.]], [[MISMATCH_SVE_LOOP_INC:%.*]] ]
				; LOOP-DEL-NEXT: [[MISMATCH_SVE_INDEX:%.]] = phi i64 [ [[TMP1]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP29:%.]], [[MISMATCH_SVE_LOOP_INC]] ]
				; LOOP-DEL-NEXT: [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[MISMATCH_SVE_INDEX]]
				; LOOP-DEL-NEXT: [[TMP23:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP22]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; LOOP-DEL-NEXT: [[TMP24:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[MISMATCH_SVE_INDEX]]
				; LOOP-DEL-NEXT: [[TMP25:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP24]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; LOOP-DEL-NEXT: [[TMP26:%.*]] = icmp ne <vscale x 16 x i8> [[TMP23]], [[TMP25]]
				; LOOP-DEL-NEXT: [[TMP27:%.*]] = select <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i1> [[TMP26]], <vscale x 16 x i1> zeroinitializer
				; LOOP-DEL-NEXT: [[TMP28:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP27]])
				; LOOP-DEL-NEXT: br i1 [[TMP28]], label [[MISMATCH_SVE_LOOP_FOUND:%.*]], label [[MISMATCH_SVE_LOOP_INC]]
				; LOOP-DEL: mismatch_sve_loop_inc:
				; LOOP-DEL-NEXT: [[TMP29]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP21]]
				; LOOP-DEL-NEXT: [[TMP30]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP29]], i64 [[TMP2]])
				; LOOP-DEL-NEXT: [[TMP31:%.*]] = extractelement <vscale x 16 x i1> [[TMP30]], i64 0
				; LOOP-DEL-NEXT: br i1 [[TMP31]], label [[MISMATCH_SVE_LOOP]], label [[WHILE_END]]
				; LOOP-DEL: mismatch_sve_loop_found:
				; LOOP-DEL-NEXT: [[TMP32:%.*]] = and <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], [[TMP27]]
				; LOOP-DEL-NEXT: [[TMP33:%.*]] = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> [[TMP32]], i32 1)
				; LOOP-DEL-NEXT: [[TMP34:%.*]] = zext i32 [[TMP33]] to i64
				; LOOP-DEL-NEXT: [[TMP35:%.*]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP34]]
				; LOOP-DEL-NEXT: [[TMP36:%.*]] = trunc i64 [[TMP35]] to i32
				; LOOP-DEL-NEXT: br label [[WHILE_END]]
				; LOOP-DEL: mismatch_loop_pre:
				; LOOP-DEL-NEXT: [[MISMATCH_START_INDEX:%.*]] = phi i32 [ [[TMP0]], [[MISMATCH_MEM_CHECK]] ], [ [[TMP0]], [[PH]] ]
				; LOOP-DEL-NEXT: br label [[MISMATCH_LOOP:%.*]]
				; LOOP-DEL: mismatch_loop:
				; LOOP-DEL-NEXT: [[MISMATCH_INDEX:%.]] = phi i32 [ [[MISMATCH_START_INDEX]], [[MISMATCH_LOOP_PRE]] ], [ [[TMP43:%.]], [[MISMATCH_LOOP_INC:%.*]] ]
				; LOOP-DEL-NEXT: [[TMP37:%.*]] = zext i32 [[MISMATCH_INDEX]] to i64
				; LOOP-DEL-NEXT: [[TMP38:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP37]]
				; LOOP-DEL-NEXT: [[TMP39:%.*]] = load i8, ptr [[TMP38]], align 1
				; LOOP-DEL-NEXT: [[TMP40:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[TMP37]]
				; LOOP-DEL-NEXT: [[TMP41:%.*]] = load i8, ptr [[TMP40]], align 1
				; LOOP-DEL-NEXT: [[TMP42:%.*]] = icmp eq i8 [[TMP39]], [[TMP41]]
				; LOOP-DEL-NEXT: br i1 [[TMP42]], label [[MISMATCH_LOOP_INC]], label [[WHILE_END]]
				; LOOP-DEL: mismatch_loop_inc:
				; LOOP-DEL-NEXT: [[TMP43]] = add i32 [[MISMATCH_INDEX]], 1
				; LOOP-DEL-NEXT: [[TMP44:%.*]] = icmp eq i32 [[MISMATCH_INDEX]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[TMP44]], label [[WHILE_END]], label [[MISMATCH_LOOP]]
				; LOOP-DEL: while.end:
				; LOOP-DEL-NEXT: [[INC_LCSSA:%.]] = phi i32 [ [[X]], [[ENTRY:%.]] ], [ [[N]], [[MISMATCH_LOOP_INC]] ], [ [[MISMATCH_INDEX]], [[MISMATCH_LOOP]] ], [ [[N]], [[MISMATCH_SVE_LOOP_INC]] ], [ [[TMP36]], [[MISMATCH_SVE_LOOP_FOUND]] ]
				; LOOP-DEL-NEXT: ret i32 [[INC_LCSSA]]
				;
				entry:
				%cmp.x = icmp ult i32 %n, %x
				br i1 %cmp.x, label %ph, label %while.end

				ph:
				br label %while.cond

				while.cond:
				%len.addr = phi i32 [ %len, %ph ], [ %inc, %while.body ]
				%inc = add i32 %len.addr, 1
				%cmp.not = icmp eq i32 %inc, %n
				br i1 %cmp.not, label %while.end, label %while.body

				while.body:
				%idxprom = zext i32 %inc to i64
				%arrayidx = getelementptr inbounds i8, ptr %a, i64 %idxprom
				%0 = load i8, ptr %arrayidx
				%arrayidx2 = getelementptr inbounds i8, ptr %b, i64 %idxprom
				%1 = load i8, ptr %arrayidx2
				%cmp.not2 = icmp eq i8 %0, %1
				br i1 %cmp.not2, label %while.cond, label %while.end

				while.end:
				%inc.lcssa = phi i32 [ %inc, %while.body ], [ %inc, %while.cond ], [ %x, %entry ]
				ret i32 %inc.lcssa
				}

				define void @compare_bytes_cleanup_block(ptr %src1, ptr %src2) {
				; CHECK-LABEL: define void @compare_bytes_cleanup_block(
				; CHECK-SAME: ptr [[SRC1:%.]], ptr [[SRC2:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[MISMATCH_MIN_IT_CHECK:%.*]]
				; CHECK: mismatch_min_it_check:
				; CHECK-NEXT: br i1 false, label [[MISMATCH_MEM_CHECK:%.]], label [[MISMATCH_LOOP_PRE:%.]], !prof [[PROF0]]
				; CHECK: mismatch_mem_check:
				; CHECK-NEXT: [[TMP0:%.*]] = getelementptr i8, ptr [[SRC1]], i64 1
				; CHECK-NEXT: [[TMP1:%.*]] = getelementptr i8, ptr [[SRC2]], i64 1
				; CHECK-NEXT: [[TMP2:%.*]] = ptrtoint ptr [[TMP1]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = ptrtoint ptr [[TMP0]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = getelementptr i8, ptr [[SRC1]], i64 0
				; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[SRC2]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = ptrtoint ptr [[TMP4]] to i64
				; CHECK-NEXT: [[TMP7:%.*]] = ptrtoint ptr [[TMP5]] to i64
				; CHECK-NEXT: [[TMP8:%.*]] = lshr i64 [[TMP3]], 12
				; CHECK-NEXT: [[TMP9:%.*]] = lshr i64 [[TMP6]], 12
				; CHECK-NEXT: [[TMP10:%.*]] = lshr i64 [[TMP2]], 12
				; CHECK-NEXT: [[TMP11:%.*]] = lshr i64 [[TMP7]], 12
				; CHECK-NEXT: [[TMP12:%.*]] = icmp ne i64 [[TMP8]], [[TMP9]]
				; CHECK-NEXT: [[TMP13:%.*]] = icmp ne i64 [[TMP10]], [[TMP11]]
				; CHECK-NEXT: [[TMP14:%.*]] = or i1 [[TMP12]], [[TMP13]]
				; CHECK-NEXT: br i1 [[TMP14]], label [[MISMATCH_LOOP_PRE]], label [[MISMATCH_SVE_LOOP_PREHEADER:%.*]], !prof [[PROF1]]
				; CHECK: mismatch_sve_loop_preheader:
				; CHECK-NEXT: [[TMP15:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 1, i64 0)
				; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP17:%.*]] = mul nuw nsw i64 [[TMP16]], 16
				; CHECK-NEXT: br label [[MISMATCH_SVE_LOOP:%.*]]
				; CHECK: mismatch_sve_loop:
				; CHECK-NEXT: [[MISMATCH_SVE_LOOP_PRED:%.]] = phi <vscale x 16 x i1> [ [[TMP15]], [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP26:%.]], [[MISMATCH_SVE_LOOP_INC:%.*]] ]
				; CHECK-NEXT: [[MISMATCH_SVE_INDEX:%.]] = phi i64 [ 1, [[MISMATCH_SVE_LOOP_PREHEADER]] ], [ [[TMP25:%.]], [[MISMATCH_SVE_LOOP_INC]] ]
				; CHECK-NEXT: [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[SRC1]], i64 [[MISMATCH_SVE_INDEX]]
				; CHECK-NEXT: [[TMP19:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP18]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; CHECK-NEXT: [[TMP20:%.*]] = getelementptr inbounds i8, ptr [[SRC2]], i64 [[MISMATCH_SVE_INDEX]]
				; CHECK-NEXT: [[TMP21:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP20]], i32 1, <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i8> zeroinitializer)
				; CHECK-NEXT: [[TMP22:%.*]] = icmp ne <vscale x 16 x i8> [[TMP19]], [[TMP21]]
				; CHECK-NEXT: [[TMP23:%.*]] = select <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], <vscale x 16 x i1> [[TMP22]], <vscale x 16 x i1> zeroinitializer
				; CHECK-NEXT: [[TMP24:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP23]])
				; CHECK-NEXT: br i1 [[TMP24]], label [[MISMATCH_SVE_LOOP_FOUND:%.*]], label [[MISMATCH_SVE_LOOP_INC]]
				; CHECK: mismatch_sve_loop_inc:
				; CHECK-NEXT: [[TMP25]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP17]]
				; CHECK-NEXT: [[TMP26]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP25]], i64 0)
				; CHECK-NEXT: [[TMP27:%.*]] = extractelement <vscale x 16 x i1> [[TMP26]], i64 0
				; CHECK-NEXT: br i1 [[TMP27]], label [[MISMATCH_SVE_LOOP]], label [[MISMATCH_END:%.*]]
				; CHECK: mismatch_sve_loop_found:
				; CHECK-NEXT: [[TMP28:%.*]] = and <vscale x 16 x i1> [[MISMATCH_SVE_LOOP_PRED]], [[TMP23]]
				; CHECK-NEXT: [[TMP29:%.*]] = call i32 @llvm.experimental.cttz.elts.i32.nxv16i1(<vscale x 16 x i1> [[TMP28]], i32 1)
				; CHECK-NEXT: [[TMP30:%.*]] = zext i32 [[TMP29]] to i64
				; CHECK-NEXT: [[TMP31:%.*]] = add nuw nsw i64 [[MISMATCH_SVE_INDEX]], [[TMP30]]
				; CHECK-NEXT: [[TMP32:%.*]] = trunc i64 [[TMP31]] to i32
				; CHECK-NEXT: br label [[MISMATCH_END]]
				; CHECK: mismatch_loop_pre:
				; CHECK-NEXT: [[MISMATCH_START_INDEX:%.*]] = phi i32 [ 1, [[MISMATCH_MEM_CHECK]] ], [ 1, [[MISMATCH_MIN_IT_CHECK]] ]
				; CHECK-NEXT: br label [[MISMATCH_LOOP:%.*]]
				; CHECK: mismatch_loop:
				; CHECK-NEXT: [[MISMATCH_INDEX:%.]] = phi i32 [ [[MISMATCH_START_INDEX]], [[MISMATCH_LOOP_PRE]] ], [ [[TMP39:%.]], [[MISMATCH_LOOP_INC:%.*]] ]
				; CHECK-NEXT: [[TMP33:%.*]] = zext i32 [[MISMATCH_INDEX]] to i64
				; CHECK-NEXT: [[TMP34:%.*]] = getelementptr inbounds i8, ptr [[SRC1]], i64 [[TMP33]]
				; CHECK-NEXT: [[TMP35:%.*]] = load i8, ptr [[TMP34]], align 1
				; CHECK-NEXT: [[TMP36:%.*]] = getelementptr inbounds i8, ptr [[SRC2]], i64 [[TMP33]]
				; CHECK-NEXT: [[TMP37:%.*]] = load i8, ptr [[TMP36]], align 1
				; CHECK-NEXT: [[TMP38:%.*]] = icmp eq i8 [[TMP35]], [[TMP37]]
				; CHECK-NEXT: br i1 [[TMP38]], label [[MISMATCH_LOOP_INC]], label [[MISMATCH_END]]
				; CHECK: mismatch_loop_inc:
				; CHECK-NEXT: [[TMP39]] = add i32 [[MISMATCH_INDEX]], 1
				; CHECK-NEXT: [[TMP40:%.*]] = icmp eq i32 [[MISMATCH_INDEX]], 0
				; CHECK-NEXT: br i1 [[TMP40]], label [[MISMATCH_END]], label [[MISMATCH_LOOP]]
				; CHECK: mismatch_end:
				; CHECK-NEXT: [[MISMATCH_RESULT:%.*]] = phi i32 [ 0, [[MISMATCH_LOOP_INC]] ], [ [[MISMATCH_INDEX]], [[MISMATCH_LOOP]] ], [ 0, [[MISMATCH_SVE_LOOP_INC]] ], [ [[TMP32]], [[MISMATCH_SVE_LOOP_FOUND]] ]
				; CHECK-NEXT: br i1 true, label [[BYTE_COMPARE:%.]], label [[WHILE_COND:%.]]
				; CHECK: while.cond:
				; CHECK-NEXT: [[LEN:%.]] = phi i32 [ [[MISMATCH_RESULT]], [[WHILE_BODY:%.]] ], [ 0, [[MISMATCH_END]] ]
				; CHECK-NEXT: [[INC:%.*]] = add i32 [[MISMATCH_RESULT]], 1
				; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[MISMATCH_RESULT]], 0
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[CLEANUP_THREAD:%.*]], label [[WHILE_BODY]]
				; CHECK: while.body:
				; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[MISMATCH_RESULT]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr i8, ptr [[SRC1]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP41:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr i8, ptr [[SRC2]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP42:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; CHECK-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP41]], [[TMP42]]
				; CHECK-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[IF_END:%.*]]
				; CHECK: byte.compare:
				; CHECK-NEXT: [[TMP43:%.*]] = icmp eq i32 [[MISMATCH_RESULT]], 0
				; CHECK-NEXT: br i1 [[TMP43]], label [[CLEANUP_THREAD]], label [[IF_END]]
				; CHECK: cleanup.thread:
				; CHECK-NEXT: ret void
				; CHECK: if.end:
				; CHECK-NEXT: [[RES:%.*]] = phi i32 [ [[MISMATCH_RESULT]], [[WHILE_BODY]] ], [ [[MISMATCH_RESULT]], [[BYTE_COMPARE]] ]
				; CHECK-NEXT: ret void
				;
				; LOOP-DEL-LABEL: define void @compare_bytes_cleanup_block(
				; LOOP-DEL-SAME: ptr [[SRC1:%.]], ptr [[SRC2:%.]]) #[[ATTR0]] {
				; LOOP-DEL-NEXT: entry:
				; LOOP-DEL-NEXT: br label [[MISMATCH_LOOP:%.*]]
				; LOOP-DEL: mismatch_loop:
				; LOOP-DEL-NEXT: [[MISMATCH_INDEX:%.]] = phi i32 [ 1, [[ENTRY:%.]] ], [ [[TMP6:%.*]], [[MISMATCH_LOOP]] ]
				; LOOP-DEL-NEXT: [[TMP0:%.*]] = zext i32 [[MISMATCH_INDEX]] to i64
				; LOOP-DEL-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC1]], i64 [[TMP0]]
				; LOOP-DEL-NEXT: [[TMP2:%.*]] = load i8, ptr [[TMP1]], align 1
				; LOOP-DEL-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[SRC2]], i64 [[TMP0]]
				; LOOP-DEL-NEXT: [[TMP4:%.*]] = load i8, ptr [[TMP3]], align 1
				; LOOP-DEL-NEXT: [[TMP5:%.*]] = icmp ne i8 [[TMP2]], [[TMP4]]
				; LOOP-DEL-NEXT: [[TMP6]] = add i32 [[MISMATCH_INDEX]], 1
				; LOOP-DEL-NEXT: [[TMP7:%.*]] = icmp eq i32 [[MISMATCH_INDEX]], 0
				; LOOP-DEL-NEXT: [[OR_COND:%.*]] = or i1 [[TMP5]], [[TMP7]]
				; LOOP-DEL-NEXT: br i1 [[OR_COND]], label [[COMMON_RET:%.*]], label [[MISMATCH_LOOP]]
				; LOOP-DEL: common.ret:
				; LOOP-DEL-NEXT: ret void
				;
				entry:
				br label %while.cond

				while.cond:
				%len = phi i32 [ %inc, %while.body ], [ 0, %entry ]
				%inc = add i32 %len, 1
				%cmp.not = icmp eq i32 %inc, 0
				br i1 %cmp.not, label %cleanup.thread, label %while.body

				while.body:
				%idxprom = zext i32 %inc to i64
				%arrayidx = getelementptr i8, ptr %src1, i64 %idxprom
				%0 = load i8, ptr %arrayidx, align 1
				%arrayidx2 = getelementptr i8, ptr %src2, i64 %idxprom
				%1 = load i8, ptr %arrayidx2, align 1
				%cmp.not2 = icmp eq i8 %0, %1
				br i1 %cmp.not2, label %while.cond, label %if.end

				cleanup.thread:
				ret void

				if.end:
				%res = phi i32 [ %len, %while.body ]
				ret void
				}

				;
				; NEGATIVE TESTS
				;

				define i32 @compare_bytes_sign_ext(ptr %a, ptr %b, i32 %len, i32 %n) {
				; CHECK-LABEL: define i32 @compare_bytes_sign_ext(
				; CHECK-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[WHILE_COND:%.*]]
				; CHECK: while.cond:
				; CHECK-NEXT: [[LEN_ADDR:%.]] = phi i32 [ [[LEN]], [[ENTRY:%.]] ], [ [[INC:%.]], [[WHILE_BODY:%.]] ]
				; CHECK-NEXT: [[INC]] = add i32 [[LEN_ADDR]], 1
				; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[INC]], [[N]]
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; CHECK: while.body:
				; CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[INC]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; CHECK-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; CHECK-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; CHECK: while.end:
				; CHECK-NEXT: [[INC_LCSSA:%.*]] = phi i32 [ [[INC]], [[WHILE_BODY]] ], [ [[INC]], [[WHILE_COND]] ]
				; CHECK-NEXT: ret i32 [[INC_LCSSA]]
				;
				; LOOP-DEL-LABEL: define i32 @compare_bytes_sign_ext(
				; LOOP-DEL-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]]) #[[ATTR0]] {
				; LOOP-DEL-NEXT: entry:
				; LOOP-DEL-NEXT: br label [[WHILE_COND:%.*]]
				; LOOP-DEL: while.cond:
				; LOOP-DEL-NEXT: [[LEN_ADDR:%.]] = phi i32 [ [[LEN]], [[ENTRY:%.]] ], [ [[INC:%.]], [[WHILE_BODY:%.]] ]
				; LOOP-DEL-NEXT: [[INC]] = add i32 [[LEN_ADDR]], 1
				; LOOP-DEL-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[INC]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; LOOP-DEL: while.body:
				; LOOP-DEL-NEXT: [[IDXPROM:%.*]] = sext i32 [[INC]] to i64
				; LOOP-DEL-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; LOOP-DEL-NEXT: [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; LOOP-DEL-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; LOOP-DEL-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; LOOP-DEL-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; LOOP-DEL-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; LOOP-DEL: while.end:
				; LOOP-DEL-NEXT: [[INC_LCSSA:%.*]] = phi i32 [ [[INC]], [[WHILE_BODY]] ], [ [[INC]], [[WHILE_COND]] ]
				; LOOP-DEL-NEXT: ret i32 [[INC_LCSSA]]
				;
				entry:
				br label %while.cond

				while.cond:
				%len.addr = phi i32 [ %len, %entry ], [ %inc, %while.body ]
				%inc = add i32 %len.addr, 1
				%cmp.not = icmp eq i32 %inc, %n
				br i1 %cmp.not, label %while.end, label %while.body

				while.body:
				%idxprom = sext i32 %inc to i64
				%arrayidx = getelementptr inbounds i8, ptr %a, i64 %idxprom
				%0 = load i8, ptr %arrayidx
				%arrayidx2 = getelementptr inbounds i8, ptr %b, i64 %idxprom
				%1 = load i8, ptr %arrayidx2
				%cmp.not2 = icmp eq i8 %0, %1
				br i1 %cmp.not2, label %while.cond, label %while.end

				while.end:
				%inc.lcssa = phi i32 [ %inc, %while.body ], [ %inc, %while.cond ]
				ret i32 %inc.lcssa
				}

				define i32 @compare_bytes_signed_wrap(ptr %a, ptr %b, i32 %len, i32 %n) {
				; CHECK-LABEL: define i32 @compare_bytes_signed_wrap(
				; CHECK-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[WHILE_COND:%.*]]
				; CHECK: while.cond:
				; CHECK-NEXT: [[LEN_ADDR:%.]] = phi i32 [ [[LEN]], [[ENTRY:%.]] ], [ [[INC:%.]], [[WHILE_BODY:%.]] ]
				; CHECK-NEXT: [[INC]] = add nsw i32 [[LEN_ADDR]], 1
				; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[INC]], [[N]]
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; CHECK: while.body:
				; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[INC]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; CHECK-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; CHECK-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; CHECK: while.end:
				; CHECK-NEXT: [[INC_LCSSA:%.*]] = phi i32 [ [[INC]], [[WHILE_BODY]] ], [ [[INC]], [[WHILE_COND]] ]
				; CHECK-NEXT: ret i32 [[INC_LCSSA]]
				;
				; LOOP-DEL-LABEL: define i32 @compare_bytes_signed_wrap(
				; LOOP-DEL-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]]) #[[ATTR0]] {
				; LOOP-DEL-NEXT: entry:
				; LOOP-DEL-NEXT: br label [[WHILE_COND:%.*]]
				; LOOP-DEL: while.cond:
				; LOOP-DEL-NEXT: [[LEN_ADDR:%.]] = phi i32 [ [[LEN]], [[ENTRY:%.]] ], [ [[INC:%.]], [[WHILE_BODY:%.]] ]
				; LOOP-DEL-NEXT: [[INC]] = add nsw i32 [[LEN_ADDR]], 1
				; LOOP-DEL-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[INC]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; LOOP-DEL: while.body:
				; LOOP-DEL-NEXT: [[IDXPROM:%.*]] = zext i32 [[INC]] to i64
				; LOOP-DEL-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; LOOP-DEL-NEXT: [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; LOOP-DEL-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; LOOP-DEL-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; LOOP-DEL-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; LOOP-DEL-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; LOOP-DEL: while.end:
				; LOOP-DEL-NEXT: [[INC_LCSSA:%.*]] = phi i32 [ [[INC]], [[WHILE_BODY]] ], [ [[INC]], [[WHILE_COND]] ]
				; LOOP-DEL-NEXT: ret i32 [[INC_LCSSA]]
				;
				entry:
				br label %while.cond

				while.cond:
				%len.addr = phi i32 [ %len, %entry ], [ %inc, %while.body ]
				%inc = add nsw i32 %len.addr, 1
				%cmp.not = icmp eq i32 %inc, %n
				br i1 %cmp.not, label %while.end, label %while.body

				while.body:
				%idxprom = zext i32 %inc to i64
				%arrayidx = getelementptr inbounds i8, ptr %a, i64 %idxprom
				%0 = load i8, ptr %arrayidx
				%arrayidx2 = getelementptr inbounds i8, ptr %b, i64 %idxprom
				%1 = load i8, ptr %arrayidx2
				%cmp.not2 = icmp eq i8 %0, %1
				br i1 %cmp.not2, label %while.cond, label %while.end

				while.end:
				%inc.lcssa = phi i32 [ %inc, %while.body ], [ %inc, %while.cond ]
				ret i32 %inc.lcssa
				}

				define i32 @compare_bytes_outside_uses(ptr %a, ptr %b, i32 %len, i32 %n) {
				; CHECK-LABEL: define i32 @compare_bytes_outside_uses(
				; CHECK-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[WHILE_COND:%.*]]
				; CHECK: while.cond:
				; CHECK-NEXT: [[IV:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.]], [[WHILE_BODY:%.]] ]
				; CHECK-NEXT: [[INC]] = add i32 [[IV]], 1
				; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[INC]], [[LEN]]
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; CHECK: while.body:
				; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[INC]] to i64
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; CHECK-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; CHECK-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; CHECK-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; CHECK: while.end:
				; CHECK-NEXT: [[RES:%.*]] = phi i1 [ [[CMP_NOT]], [[WHILE_BODY]] ], [ [[CMP_NOT]], [[WHILE_COND]] ]
				; CHECK-NEXT: [[EXT_RES:%.*]] = zext i1 [[RES]] to i32
				; CHECK-NEXT: ret i32 [[EXT_RES]]
				;
				; LOOP-DEL-LABEL: define i32 @compare_bytes_outside_uses(
				; LOOP-DEL-SAME: ptr [[A:%.]], ptr [[B:%.]], i32 [[LEN:%.]], i32 [[N:%.]]) #[[ATTR0]] {
				; LOOP-DEL-NEXT: entry:
				; LOOP-DEL-NEXT: br label [[WHILE_COND:%.*]]
				; LOOP-DEL: while.cond:
				; LOOP-DEL-NEXT: [[IV:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.]], [[WHILE_BODY:%.]] ]
				; LOOP-DEL-NEXT: [[INC]] = add i32 [[IV]], 1
				; LOOP-DEL-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[INC]], [[LEN]]
				; LOOP-DEL-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; LOOP-DEL: while.body:
				; LOOP-DEL-NEXT: [[IDXPROM:%.*]] = zext i32 [[INC]] to i64
				; LOOP-DEL-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[IDXPROM]]
				; LOOP-DEL-NEXT: [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; LOOP-DEL-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[IDXPROM]]
				; LOOP-DEL-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; LOOP-DEL-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; LOOP-DEL-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; LOOP-DEL: while.end:
				; LOOP-DEL-NEXT: [[RES:%.*]] = phi i1 [ [[CMP_NOT]], [[WHILE_BODY]] ], [ [[CMP_NOT]], [[WHILE_COND]] ]
				; LOOP-DEL-NEXT: [[EXT_RES:%.*]] = zext i1 [[RES]] to i32
				; LOOP-DEL-NEXT: ret i32 [[EXT_RES]]
				;
				entry:
				br label %while.cond

				while.cond:
				%iv = phi i32 [ 0, %entry ], [ %inc, %while.body ]
				%inc = add i32 %iv, 1
				%cmp.not = icmp eq i32 %inc, %len
				br i1 %cmp.not, label %while.end, label %while.body

				while.body:
				%idxprom = zext i32 %inc to i64
				%arrayidx = getelementptr inbounds i8, ptr %a, i64 %idxprom
				%0 = load i8, ptr %arrayidx
				%arrayidx2 = getelementptr inbounds i8, ptr %b, i64 %idxprom
				%1 = load i8, ptr %arrayidx2
				%cmp.not2 = icmp eq i8 %0, %1
				br i1 %cmp.not2, label %while.cond, label %while.end

				while.end:
				%res = phi i1 [ %cmp.not, %while.body ], [ %cmp.not, %while.cond ]
				%ext_res = zext i1 %res to i32
				ret i32 %ext_res
				}

				define i64 @compare_bytes_i64_index(ptr %a, ptr %b, i64 %len, i64 %n) {
				; CHECK-LABEL: define i64 @compare_bytes_i64_index(
				; CHECK-SAME: ptr [[A:%.]], ptr [[B:%.]], i64 [[LEN:%.]], i64 [[N:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[WHILE_COND:%.*]]
				; CHECK: while.cond:
				; CHECK-NEXT: [[LEN_ADDR:%.]] = phi i64 [ [[LEN]], [[ENTRY:%.]] ], [ [[INC:%.]], [[WHILE_BODY:%.]] ]
				; CHECK-NEXT: [[INC]] = add i64 [[LEN_ADDR]], 1
				; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i64 [[INC]], [[N]]
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; CHECK: while.body:
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INC]]
				; CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INC]]
				; CHECK-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; CHECK-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; CHECK-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; CHECK: while.end:
				; CHECK-NEXT: [[INC_LCSSA:%.*]] = phi i64 [ [[INC]], [[WHILE_BODY]] ], [ [[INC]], [[WHILE_COND]] ]
				; CHECK-NEXT: ret i64 [[INC_LCSSA]]
				;
				; LOOP-DEL-LABEL: define i64 @compare_bytes_i64_index(
				; LOOP-DEL-SAME: ptr [[A:%.]], ptr [[B:%.]], i64 [[LEN:%.]], i64 [[N:%.]]) #[[ATTR0]] {
				; LOOP-DEL-NEXT: entry:
				; LOOP-DEL-NEXT: br label [[WHILE_COND:%.*]]
				; LOOP-DEL: while.cond:
				; LOOP-DEL-NEXT: [[LEN_ADDR:%.]] = phi i64 [ [[LEN]], [[ENTRY:%.]] ], [ [[INC:%.]], [[WHILE_BODY:%.]] ]
				; LOOP-DEL-NEXT: [[INC]] = add i64 [[LEN_ADDR]], 1
				; LOOP-DEL-NEXT: [[CMP_NOT:%.*]] = icmp eq i64 [[INC]], [[N]]
				; LOOP-DEL-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY]]
				; LOOP-DEL: while.body:
				; LOOP-DEL-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INC]]
				; LOOP-DEL-NEXT: [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX]], align 1
				; LOOP-DEL-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INC]]
				; LOOP-DEL-NEXT: [[TMP1:%.*]] = load i8, ptr [[ARRAYIDX2]], align 1
				; LOOP-DEL-NEXT: [[CMP_NOT2:%.*]] = icmp eq i8 [[TMP0]], [[TMP1]]
				; LOOP-DEL-NEXT: br i1 [[CMP_NOT2]], label [[WHILE_COND]], label [[WHILE_END]]
				; LOOP-DEL: while.end:
				; LOOP-DEL-NEXT: [[INC_LCSSA:%.*]] = phi i64 [ [[INC]], [[WHILE_BODY]] ], [ [[INC]], [[WHILE_COND]] ]
				; LOOP-DEL-NEXT: ret i64 [[INC_LCSSA]]
				;
				entry:
				br label %while.cond

				while.cond:
				%len.addr = phi i64 [ %len, %entry ], [ %inc, %while.body ]
				%inc = add i64 %len.addr, 1
				%cmp.not = icmp eq i64 %inc, %n
				br i1 %cmp.not, label %while.end, label %while.body

				while.body:
				%arrayidx = getelementptr inbounds i8, ptr %a, i64 %inc
				%0 = load i8, ptr %arrayidx
				%arrayidx2 = getelementptr inbounds i8, ptr %b, i64 %inc
				%1 = load i8, ptr %arrayidx2
				%cmp.not2 = icmp eq i8 %0, %1
				br i1 %cmp.not2, label %while.cond, label %while.end

				while.end:
				%inc.lcssa = phi i64 [ %inc, %while.body ], [ %inc, %while.cond ]
				ret i64 %inc.lcssa
				}

llvm/test/Transforms/PhaseOrdering/ARM/arm_mean_q7.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -passes='default<O3>' -S \| FileCheck %s			; RUN: opt < %s -passes='default<O3>' -S \| FileCheck %s

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv6m-none-none-eabi"			target triple = "thumbv6m-none-none-eabi"

	; Make sure we don't make a mess of vectorization/unrolling of the remainder loop.			; Make sure we don't make a mess of vectorization/unrolling of the remainder loop.

	define void @arm_mean_q7(ptr noundef %pSrc, i32 noundef %blockSize, ptr noundef %pResult) #0 {			define void @arm_mean_q7(ptr noundef %pSrc, i32 noundef %blockSize, ptr noundef %pResult) #0 {
	; CHECK-LABEL: @arm_mean_q7(			; CHECK-LABEL: @arm_mean_q7(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[CMP_NOT10:%.]] = icmp ult i32 [[BLOCKSIZE:%.]], 16			; CHECK-NEXT: [[CMP_NOT10:%.]] = icmp ult i32 [[BLOCKSIZE:%.]], 16
	; CHECK-NEXT: br i1 [[CMP_NOT10]], label [[WHILE_END:%.]], label [[WHILE_BODY_PREHEADER:%.]]			; CHECK-NEXT: br i1 [[CMP_NOT10]], label [[WHILE_END:%.]], label [[WHILE_BODY_PREHEADER:%.]]
	; CHECK: while.body.preheader:			; CHECK: while.body.preheader:
	; CHECK-NEXT: [[SHR:%.*]] = lshr i32 [[BLOCKSIZE]], 4			; CHECK-NEXT: [[SHR:%.*]] = lshr i32 [[BLOCKSIZE]], 4
	; CHECK-NEXT: [[TMP0:%.*]] = and i32 [[BLOCKSIZE]], -16
	; CHECK-NEXT: br label [[WHILE_BODY:%.*]]			; CHECK-NEXT: br label [[WHILE_BODY:%.*]]
	; CHECK: while.body:			; CHECK: while.body:
	; CHECK-NEXT: [[SUM_013:%.]] = phi i32 [ [[TMP3:%.]], [[WHILE_BODY]] ], [ 0, [[WHILE_BODY_PREHEADER]] ]			; CHECK-NEXT: [[SUM_013:%.]] = phi i32 [ [[TMP2:%.]], [[WHILE_BODY]] ], [ 0, [[WHILE_BODY_PREHEADER]] ]
	; CHECK-NEXT: [[PSRC_ADDR_012:%.]] = phi ptr [ [[ADD_PTR:%.]], [[WHILE_BODY]] ], [ [[PSRC:%.*]], [[WHILE_BODY_PREHEADER]] ]			; CHECK-NEXT: [[PSRC_ADDR_012:%.]] = phi ptr [ [[ADD_PTR:%.]], [[WHILE_BODY]] ], [ [[PSRC:%.*]], [[WHILE_BODY_PREHEADER]] ]
	; CHECK-NEXT: [[BLKCNT_011:%.]] = phi i32 [ [[DEC:%.]], [[WHILE_BODY]] ], [ [[SHR]], [[WHILE_BODY_PREHEADER]] ]			; CHECK-NEXT: [[BLKCNT_011:%.]] = phi i32 [ [[DEC:%.]], [[WHILE_BODY]] ], [ [[SHR]], [[WHILE_BODY_PREHEADER]] ]
	; CHECK-NEXT: [[TMP1:%.*]] = load <16 x i8>, ptr [[PSRC_ADDR_012]], align 1			; CHECK-NEXT: [[TMP0:%.*]] = load <16 x i8>, ptr [[PSRC_ADDR_012]], align 1
	; CHECK-NEXT: [[TMP2:%.*]] = tail call i32 @llvm.arm.mve.addv.v16i8(<16 x i8> [[TMP1]], i32 0)			; CHECK-NEXT: [[TMP1:%.*]] = tail call i32 @llvm.arm.mve.addv.v16i8(<16 x i8> [[TMP0]], i32 0)
	; CHECK-NEXT: [[TMP3]] = add i32 [[TMP2]], [[SUM_013]]			; CHECK-NEXT: [[TMP2]] = add i32 [[TMP1]], [[SUM_013]]
	; CHECK-NEXT: [[DEC]] = add nsw i32 [[BLKCNT_011]], -1			; CHECK-NEXT: [[DEC]] = add nsw i32 [[BLKCNT_011]], -1
	; CHECK-NEXT: [[ADD_PTR]] = getelementptr inbounds i8, ptr [[PSRC_ADDR_012]], i32 16			; CHECK-NEXT: [[ADD_PTR]] = getelementptr inbounds i8, ptr [[PSRC_ADDR_012]], i32 16
	; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[DEC]], 0			; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[DEC]], 0
	; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END_LOOPEXIT:%.*]], label [[WHILE_BODY]]			; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END_LOOPEXIT:%.*]], label [[WHILE_BODY]]
	; CHECK: while.end.loopexit:			; CHECK: while.end.loopexit:
	; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i8, ptr [[PSRC]], i32 [[TMP0]]			; CHECK-NEXT: [[TMP3:%.*]] = and i32 [[BLOCKSIZE]], -16
				; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i8, ptr [[PSRC]], i32 [[TMP3]]
	; CHECK-NEXT: br label [[WHILE_END]]			; CHECK-NEXT: br label [[WHILE_END]]
	; CHECK: while.end:			; CHECK: while.end:
	; CHECK-NEXT: [[PSRC_ADDR_0_LCSSA:%.]] = phi ptr [ [[PSRC]], [[ENTRY:%.]] ], [ [[SCEVGEP]], [[WHILE_END_LOOPEXIT]] ]			; CHECK-NEXT: [[PSRC_ADDR_0_LCSSA:%.]] = phi ptr [ [[PSRC]], [[ENTRY:%.]] ], [ [[SCEVGEP]], [[WHILE_END_LOOPEXIT]] ]
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[TMP3]], [[WHILE_END_LOOPEXIT]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[TMP2]], [[WHILE_END_LOOPEXIT]] ]
	; CHECK-NEXT: [[AND:%.*]] = and i32 [[BLOCKSIZE]], 15			; CHECK-NEXT: [[AND:%.*]] = and i32 [[BLOCKSIZE]], 15
	; CHECK-NEXT: [[CMP2_NOT15:%.*]] = icmp eq i32 [[AND]], 0			; CHECK-NEXT: [[CMP2_NOT15:%.*]] = icmp eq i32 [[AND]], 0
	; CHECK-NEXT: br i1 [[CMP2_NOT15]], label [[WHILE_END5:%.]], label [[MIDDLE_BLOCK:%.]]			; CHECK-NEXT: br i1 [[CMP2_NOT15]], label [[WHILE_END5:%.]], label [[MIDDLE_BLOCK:%.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = tail call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 [[AND]])			; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = tail call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 [[AND]])
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[PSRC_ADDR_0_LCSSA]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[PSRC_ADDR_0_LCSSA]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
	; CHECK-NEXT: [[TMP4:%.*]] = sext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>			; CHECK-NEXT: [[TMP4:%.*]] = sext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
	; CHECK-NEXT: [[TMP5:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP4]], <16 x i32> zeroinitializer			; CHECK-NEXT: [[TMP5:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP4]], <16 x i32> zeroinitializer
	▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines

llvm/utils/gn/secondary/llvm/lib/Target/AArch64/BUILD.gn

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	sources = [
"AArch64FalkorHWPFFix.cpp",		"AArch64FalkorHWPFFix.cpp",
"AArch64FastISel.cpp",		"AArch64FastISel.cpp",
"AArch64FrameLowering.cpp",		"AArch64FrameLowering.cpp",
"AArch64GlobalsTagging.cpp",		"AArch64GlobalsTagging.cpp",
"AArch64ISelDAGToDAG.cpp",		"AArch64ISelDAGToDAG.cpp",
"AArch64ISelLowering.cpp",		"AArch64ISelLowering.cpp",
"AArch64InstrInfo.cpp",		"AArch64InstrInfo.cpp",
"AArch64LoadStoreOptimizer.cpp",		"AArch64LoadStoreOptimizer.cpp",
		"AArch64LoopIdiomRecognize.cpp",
"AArch64LowerHomogeneousPrologEpilog.cpp",		"AArch64LowerHomogeneousPrologEpilog.cpp",
"AArch64MCInstLower.cpp",		"AArch64MCInstLower.cpp",
"AArch64MIPeepholeOpt.cpp",		"AArch64MIPeepholeOpt.cpp",
"AArch64MachineFunctionInfo.cpp",		"AArch64MachineFunctionInfo.cpp",
"AArch64MachineScheduler.cpp",		"AArch64MachineScheduler.cpp",
"AArch64MacroFusion.cpp",		"AArch64MacroFusion.cpp",
"AArch64PBQPRegAlloc.cpp",		"AArch64PBQPRegAlloc.cpp",
"AArch64PromoteConstant.cpp",		"AArch64PromoteConstant.cpp",
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[PoC][WIP] Add an AArch64 specific pass for loop idiom recognitionNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 556134

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/include/llvm/IR/Intrinsics.td

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

llvm/lib/Passes/PassBuilderPipelines.cpp

llvm/lib/Target/AArch64/AArch64.h

llvm/lib/Target/AArch64/AArch64ISelLowering.h

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.h

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

llvm/lib/Target/AArch64/AArch64TargetMachine.h

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp

llvm/lib/Target/AArch64/CMakeLists.txt

llvm/test/CodeGen/AArch64/intrinsic-cttz-elts.ll

llvm/test/Other/new-pm-defaults.ll

llvm/test/Transforms/LoopIdiom/AArch64/byte-compare-index.ll

llvm/test/Transforms/PhaseOrdering/ARM/arm_mean_q7.ll

llvm/utils/gn/secondary/llvm/lib/Target/AArch64/BUILD.gn

[PoC][WIP] Add an AArch64 specific pass for loop idiom recognition
Needs ReviewPublic