This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
Target/ARM/
-
ARM/
-
ARMTargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
3/9
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/ARM/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
mve-reductions.ll

Differential D106166

[LV][ARM] Tighten up MLA reduction costing
ClosedPublic

Authored by dmgreen on Jul 16 2021, 10:37 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
samtebbs
spatel
RKSimon

Commits

rG41cedb1c9a38: [LV][ARM] Tighten up MLA reduction costing

Summary

This makes a couple of changes to the costing of MLA reduction patterns, to more accurately cost various patterns that can come up from vectorization.

The Arm implementation of getExtendedAddReductionCost is altered to only provide costs for legal or smaller types. Larger than legal types need to be split, which currently does not work very well, especially for predicated reductions where the predicate may be legal but needs to be split. Currently we limit it to legal or smaller input types.
The getReductionPatternCost has learnt that reduce(ext(mul(ext, ext)) is a pattern that can come up, and can be treated the same as reduce(mul(ext, ext)) providing the extension types match.
And it has been adjusted to not count the ext in reduce(mul(ext, ext)) as part of a reduce(mul) pattern.

Together these changes help to more accurately cost the mla reductions in cases such as where the extend types don't match or the extend opcodes are different, picking better vector factors that don't result in expanded reductions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.Jul 16 2021, 10:37 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald TranscriptJul 16 2021, 10:37 AM

dmgreen requested review of this revision.Jul 16 2021, 10:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2021, 10:37 AM

Harbormaster completed remote builds in B114547: Diff 359315.Jul 16 2021, 10:37 AM

RKSimon added inline comments.Jul 20 2021, 8:29 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7168	pre-commit these PatternMatch NFCs ?

dmgreen mentioned this in rG72dc5cab4f8b: [LV] Make use of PatternMatchers in getReductionPatternCost. NFC.Jul 21 2021, 3:34 AM

That does sound like a good suggestion. Split off the NFC pattern match code.

Harbormaster completed remote builds in B115276: Diff 360396.Jul 21 2021, 3:36 AM

SjoerdMeijer added inline comments.Jul 28 2021, 12:51 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7216	Do we also need to match `op1`? match(Op1, m_ZExtOrSExt(m_Value()) that's what I would guess from reading the comment below.
7221	Nit: might be easier to read if this comment is just before the if.
7235	Just a quick query on the `* 2`, was wondering if that needs to be 3, but probably depends on my earlier question about matching op1.

Thanks for taking a look. This improves the accuracy of the new double extend costing and extends the comment.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7216	We check Op0->getOpcode() == Op1->getOpcode(), so do this without a matcher.
7221	The comments are all inside the `ifs` to be consistent with the other code here. Otherwise they need to be before the `else`, and I'm not sure that is very readable.
7235	It could be either way, I think, but the third extend would be of a different type. Because they are equivalent (https://alive2.llvm.org/ce/z/1dVe_y), I was just taking the simpler route and costing them as two larger extends and a multiply. I have improved that to use the original types though. It will be good for it to be more accurate.

Harbormaster completed remote builds in B116630: Diff 362299.Jul 28 2021, 1:23 AM

Cheers, looks like a good change to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7216	Ah, okay, missed it, but there it is.
7221	Okay, not a big fan of that style, but fair enough.

This revision is now accepted and ready to land.Jul 28 2021, 1:37 AM

This revision was landed with ongoing or failed builds.Jul 28 2021, 4:51 AM

Closed by commit rG41cedb1c9a38: [LV][ARM] Tighten up MLA reduction costing (authored by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG41cedb1c9a38: [LV][ARM] Tighten up MLA reduction costing.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMTargetTransformInfo.cpp

19 lines

Transforms/

Vectorize/

LoopVectorize.cpp

39 lines

test/

Transforms/

LoopVectorize/

ARM/

mve-reductions.ll

106 lines

Diff 362337

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

	Show First 20 Lines • Show All 1,617 Lines • ▼ Show 20 Lines
	}			}

	InstructionCost			InstructionCost
	ARMTTIImpl::getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,			ARMTTIImpl::getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,
	Type ResTy, VectorType ValTy,			Type ResTy, VectorType ValTy,
	TTI::TargetCostKind CostKind) {			TTI::TargetCostKind CostKind) {
	EVT ValVT = TLI->getValueType(DL, ValTy);			EVT ValVT = TLI->getValueType(DL, ValTy);
	EVT ResVT = TLI->getValueType(DL, ResTy);			EVT ResVT = TLI->getValueType(DL, ResTy);

	if (ST->hasMVEIntegerOps() && ValVT.isSimple() && ResVT.isSimple()) {			if (ST->hasMVEIntegerOps() && ValVT.isSimple() && ResVT.isSimple()) {
	std::pair<InstructionCost, MVT> LT =			std::pair<InstructionCost, MVT> LT =
	TLI->getTypeLegalizationCost(DL, ValTy);			TLI->getTypeLegalizationCost(DL, ValTy);
	if ((LT.second == MVT::v16i8 && ResVT.getSizeInBits() <= 32) \|\|
	(LT.second == MVT::v8i16 &&			// The legal cases are:
	ResVT.getSizeInBits() <= (IsMLA ? 64 : 32)) \|\|			// VADDV u/s 8/16/32
	(LT.second == MVT::v4i32 && ResVT.getSizeInBits() <= 64))			// VMLAV u/s 8/16/32
				// VADDLV u/s 32
				// VMLALV u/s 16/32
				// Codegen currently cannot always handle larger than legal vectors very
				// well, especially for predicated reductions where the mask needs to be
				// split, so restrict to 128bit or smaller input types.
				unsigned RevVTSize = ResVT.getSizeInBits();
				if (ValVT.getSizeInBits() <= 128 &&
				((LT.second == MVT::v16i8 && RevVTSize <= 32) \|\|
				(LT.second == MVT::v8i16 && RevVTSize <= (IsMLA ? 64 : 32)) \|\|
				(LT.second == MVT::v4i32 && RevVTSize <= 64)))
	return ST->getMVEVectorCostFactor(CostKind) * LT.first;			return ST->getMVEVectorCostFactor(CostKind) * LT.first;
	}			}

	return BaseT::getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, ValTy,			return BaseT::getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, ValTy,
	CostKind);			CostKind);
	}			}

	InstructionCost			InstructionCost
	▲ Show 20 Lines • Show All 623 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,159 Lines • ▼ Show 20 Lines	Optional<InstructionCost> LoopVectorizationCostModel::getReductionPatternCost(
// The basic idea is that we walk down the tree to do that, finding the root		// The basic idea is that we walk down the tree to do that, finding the root
// reduction instruction in InLoopReductionImmediateChains. From there we find		// reduction instruction in InLoopReductionImmediateChains. From there we find
// the pattern of mul/ext and test the cost of the entire pattern vs the cost		// the pattern of mul/ext and test the cost of the entire pattern vs the cost
// of the components. If the reduction cost is lower then we return it for the		// of the components. If the reduction cost is lower then we return it for the
// reduction instruction and 0 for the other instructions in the pattern. If		// reduction instruction and 0 for the other instructions in the pattern. If
// it is not we return an invalid cost specifying the orignal cost method		// it is not we return an invalid cost specifying the orignal cost method
// should be used.		// should be used.
Instruction *RetI = I;		Instruction *RetI = I;
if (match(RetI, m_ZExtOrSExt(m_Value()))) {		if (match(RetI, m_ZExtOrSExt(m_Value()))) {
		RKSimonUnsubmitted Not Done Reply Inline Actions pre-commit these PatternMatch NFCs ? RKSimon: pre-commit these PatternMatch NFCs ?
if (!RetI->hasOneUser())		if (!RetI->hasOneUser())
return None;		return None;
RetI = RetI->user_back();		RetI = RetI->user_back();
}		}
if (match(RetI, m_Mul(m_Value(), m_Value())) &&		if (match(RetI, m_Mul(m_Value(), m_Value())) &&
RetI->user_back()->getOpcode() == Instruction::Add) {		RetI->user_back()->getOpcode() == Instruction::Add) {
if (!RetI->hasOneUser())		if (!RetI->hasOneUser())
return None;		return None;
Show All 28 Lines	Optional<InstructionCost> LoopVectorizationCostModel::getReductionPatternCost(
// patterns, returning the better cost if it is found.		// patterns, returning the better cost if it is found.
Instruction *RedOp = RetI->getOperand(1) == LastChain		Instruction *RedOp = RetI->getOperand(1) == LastChain
? dyn_cast<Instruction>(RetI->getOperand(0))		? dyn_cast<Instruction>(RetI->getOperand(0))
: dyn_cast<Instruction>(RetI->getOperand(1));		: dyn_cast<Instruction>(RetI->getOperand(1));

VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);		VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);

Instruction Op0, Op1;		Instruction Op0, Op1;
if (RedOp && match(RedOp, m_ZExtOrSExt(m_Value())) &&		if (RedOp &&
		match(RedOp,
		m_ZExtOrSExt(m_Mul(m_Instruction(Op0), m_Instruction(Op1)))) &&
		match(Op0, m_ZExtOrSExt(m_Value())) &&
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Do we also need to match `op1`? match(Op1, m_ZExtOrSExt(m_Value()) that's what I would guess from reading the comment below. SjoerdMeijer: Do we also need to match `op1`? match(Op1, m_ZExtOrSExt(m_Value()) that's what I would…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions We check Op0->getOpcode() == Op1->getOpcode(), so do this without a matcher. dmgreen: We check Op0->getOpcode() == Op1->getOpcode(), so do this without a matcher.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Ah, okay, missed it, but there it is. SjoerdMeijer: Ah, okay, missed it, but there it is.
		Op0->getOpcode() == Op1->getOpcode() &&
		Op0->getOperand(0)->getType() == Op1->getOperand(0)->getType() &&
		!TheLoop->isLoopInvariant(Op0) && !TheLoop->isLoopInvariant(Op1) &&
		(Op0->getOpcode() == RedOp->getOpcode() \|\| Op0 == Op1)) {

		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Nit: might be easier to read if this comment is just before the if. SjoerdMeijer: Nit: might be easier to read if this comment is just before the if.
		dmgreenAuthorUnsubmitted Done Reply Inline Actions The comments are all inside the `ifs` to be consistent with the other code here. Otherwise they need to be before the `else`, and I'm not sure that is very readable. dmgreen: The comments are all inside the `ifs` to be consistent with the other code here. Otherwise they…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Okay, not a big fan of that style, but fair enough. SjoerdMeijer: Okay, not a big fan of that style, but fair enough.
		// Matched reduce(ext(mul(ext(A), ext(B)))
		// Note that the extend opcodes need to all match, or if A==B they will have
		// been converted to zext(mul(sext(A), sext(A))) as it is known positive,
		// which is equally fine.
		bool IsUnsigned = isa<ZExtInst>(Op0);
		auto *ExtType = VectorType::get(Op0->getOperand(0)->getType(), VectorTy);
		auto *MulType = VectorType::get(Op0->getType(), VectorTy);

		InstructionCost ExtCost =
		TTI.getCastInstrCost(Op0->getOpcode(), MulType, ExtType,
		TTI::CastContextHint::None, CostKind, Op0);
		InstructionCost MulCost =
		TTI.getArithmeticInstrCost(Instruction::Mul, MulType, CostKind);
		InstructionCost Ext2Cost =
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Just a quick query on the `* 2`, was wondering if that needs to be 3, but probably depends on my earlier question about matching op1. SjoerdMeijer: Just a quick query on the `* 2`, was wondering if that needs to be 3, but probably depends on…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions It could be either way, I think, but the third extend would be of a different type. Because they are equivalent (https://alive2.llvm.org/ce/z/1dVe_y), I was just taking the simpler route and costing them as two larger extends and a multiply. I have improved that to use the original types though. It will be good for it to be more accurate. dmgreen: It could be either way, I think, but the third extend would be of a different type. Because…
		TTI.getCastInstrCost(RedOp->getOpcode(), VectorTy, MulType,
		TTI::CastContextHint::None, CostKind, RedOp);

		InstructionCost RedCost = TTI.getExtendedAddReductionCost(
		/IsMLA=/true, IsUnsigned, RdxDesc.getRecurrenceType(), ExtType,
		CostKind);

		if (RedCost.isValid() &&
		RedCost < ExtCost * 2 + MulCost + Ext2Cost + BaseCost)
		return I == RetI ? RedCost : 0;
		} else if (RedOp && match(RedOp, m_ZExtOrSExt(m_Value())) &&
!TheLoop->isLoopInvariant(RedOp)) {		!TheLoop->isLoopInvariant(RedOp)) {
// Matched reduce(ext(A))		// Matched reduce(ext(A))
bool IsUnsigned = isa<ZExtInst>(RedOp);		bool IsUnsigned = isa<ZExtInst>(RedOp);
auto *ExtType = VectorType::get(RedOp->getOperand(0)->getType(), VectorTy);		auto *ExtType = VectorType::get(RedOp->getOperand(0)->getType(), VectorTy);
InstructionCost RedCost = TTI.getExtendedAddReductionCost(		InstructionCost RedCost = TTI.getExtendedAddReductionCost(
/IsMLA=/false, IsUnsigned, RdxDesc.getRecurrenceType(), ExtType,		/IsMLA=/false, IsUnsigned, RdxDesc.getRecurrenceType(), ExtType,
CostKind);		CostKind);

InstructionCost ExtCost =		InstructionCost ExtCost =
Show All 17 Lines	if (match(Op0, m_ZExtOrSExt(m_Value())) &&
TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy, CostKind);		TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy, CostKind);

InstructionCost RedCost = TTI.getExtendedAddReductionCost(		InstructionCost RedCost = TTI.getExtendedAddReductionCost(
/IsMLA=/true, IsUnsigned, RdxDesc.getRecurrenceType(), ExtType,		/IsMLA=/true, IsUnsigned, RdxDesc.getRecurrenceType(), ExtType,
CostKind);		CostKind);

if (RedCost.isValid() && RedCost < ExtCost * 2 + MulCost + BaseCost)		if (RedCost.isValid() && RedCost < ExtCost * 2 + MulCost + BaseCost)
return I == RetI ? RedCost : 0;		return I == RetI ? RedCost : 0;
} else {		} else if (!match(I, m_ZExtOrSExt(m_Value()))) {
// Matched reduce(mul())		// Matched reduce(mul())
InstructionCost MulCost =		InstructionCost MulCost =
TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy, CostKind);		TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy, CostKind);

InstructionCost RedCost = TTI.getExtendedAddReductionCost(		InstructionCost RedCost = TTI.getExtendedAddReductionCost(
/IsMLA=/true, true, RdxDesc.getRecurrenceType(), VectorTy,		/IsMLA=/true, true, RdxDesc.getRecurrenceType(), VectorTy,
CostKind);		CostKind);

▲ Show 20 Lines • Show All 3,201 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll

Show First 20 Lines • Show All 716 Lines • ▼ Show 20 Lines	for.body: ; preds = %entry, %for.body
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

; 8x to use VMLAL.u16		; 8x to use VMLAL.u16
; FIXME: 8x, TailPredicate, double-extended		; FIXME: TailPredicate
define i64 @mla_i8_i64(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i8_i64(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i64(		; CHECK-LABEL: @mla_i8_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: for.body.preheader:		; CHECK: for.body.preheader:
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 16		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -16		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <8 x i8>*
; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i8>, <16 x i8> [[TMP1]], align 1		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x i8>, <8 x i8> [[TMP1]], align 1
; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = zext <8 x i8> [[WIDE_LOAD]] to <8 x i32>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <8 x i8>*
; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <16 x i8>, <16 x i8> [[TMP4]], align 1		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x i8>, <8 x i8> [[TMP4]], align 1
; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i8> [[WIDE_LOAD1]] to <16 x i32>		; CHECK-NEXT: [[TMP5:%.*]] = zext <8 x i8> [[WIDE_LOAD1]] to <8 x i32>
; CHECK-NEXT: [[TMP6:%.*]] = mul nuw nsw <16 x i32> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nuw nsw <8 x i32> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = zext <16 x i32> [[TMP6]] to <16 x i64>		; CHECK-NEXT: [[TMP7:%.*]] = zext <8 x i32> [[TMP6]] to <8 x i64>
; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i64 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i64 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP18:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP18:![0-9]+]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
▲ Show 20 Lines • Show All 374 Lines • ▼ Show 20 Lines

; 4x or 8x as different types		; 4x or 8x as different types
define i32 @red_mla_ext_s8_s16_s32(i8* noalias nocapture readonly %A, i16* noalias nocapture readonly %B, i32 %n) #0 {		define i32 @red_mla_ext_s8_s16_s32(i8* noalias nocapture readonly %A, i16* noalias nocapture readonly %B, i32 %n) #0 {
; CHECK-LABEL: @red_mla_ext_s8_s16_s32(		; CHECK-LABEL: @red_mla_ext_s8_s16_s32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP9_NOT:%.]] = icmp eq i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP9_NOT:%.]] = icmp eq i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP9_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[CMP9_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <8 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <4 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8> [[TMP1]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP1]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = sext <8 x i8> [[WIDE_MASKED_LOAD]] to <8 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[B:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[B:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <8 x i16>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <4 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP4]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP4]], i32 2, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i16> poison)
; CHECK-NEXT: [[TMP5:%.*]] = sext <8 x i16> [[WIDE_MASKED_LOAD1]] to <8 x i32>		; CHECK-NEXT: [[TMP5:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD1]] to <4 x i32>
; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <8 x i32> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <4 x i32> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i32> [[TMP6]], <8 x i32> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP6]], <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[S_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[S_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[S_0_LCSSA]]		; CHECK-NEXT: ret i32 [[S_0_LCSSA]]
;		;
entry:		entry:
%cmp9.not = icmp eq i32 %n, 0		%cmp9.not = icmp eq i32 %n, 0
Show All 21 Lines	for.cond.cleanup.loopexit: ; preds = %for.body
%add.lcssa = phi i32 [ %add, %for.body ]		%add.lcssa = phi i32 [ %add, %for.body ]
br label %for.cond.cleanup		br label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry		for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
%s.0.lcssa = phi i32 [ 0, %entry ], [ %add.lcssa, %for.cond.cleanup.loopexit ]		%s.0.lcssa = phi i32 [ 0, %entry ], [ %add.lcssa, %for.cond.cleanup.loopexit ]
ret i32 %s.0.lcssa		ret i32 %s.0.lcssa
}		}

; FIXME: 4x as different sext vs zext		; 4x as different sext vs zext
define i64 @red_mla_ext_s16_u16_s64(i16* noalias nocapture readonly %A, i16* noalias nocapture readonly %B, i32 %n) #0 {		define i64 @red_mla_ext_s16_u16_s64(i16* noalias nocapture readonly %A, i16* noalias nocapture readonly %B, i32 %n) #0 {
; CHECK-LABEL: @red_mla_ext_s16_u16_s64(		; CHECK-LABEL: @red_mla_ext_s16_u16_s64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP9_NOT:%.]] = icmp eq i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP9_NOT:%.]] = icmp eq i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP9_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]		; CHECK-NEXT: br i1 [[CMP9_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]
; CHECK: for.body.preheader:		; CHECK: for.body.preheader:
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[A:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[A:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <4 x i16>*
; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x i16>, <8 x i16> [[TMP1]], align 1		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 1
; CHECK-NEXT: [[TMP2:%.*]] = sext <8 x i16> [[WIDE_LOAD]] to <8 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i16> [[WIDE_LOAD]] to <4 x i32>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[B:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[B:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <8 x i16>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <4 x i16>*
; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x i16>, <8 x i16> [[TMP4]], align 2		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i16>, <4 x i16> [[TMP4]], align 2
; CHECK-NEXT: [[TMP5:%.*]] = zext <8 x i16> [[WIDE_LOAD1]] to <8 x i32>		; CHECK-NEXT: [[TMP5:%.*]] = zext <4 x i16> [[WIDE_LOAD1]] to <4 x i32>
; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <8 x i32> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <4 x i32> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = zext <8 x i32> [[TMP6]] to <8 x i64>		; CHECK-NEXT: [[TMP7:%.*]] = zext <4 x i32> [[TMP6]] to <4 x i64>
; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i64 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i64 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP27:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP27:![0-9]+]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	for.cond.cleanup.loopexit: ; preds = %for.body
%add.lcssa = phi i64 [ %add, %for.body ]		%add.lcssa = phi i64 [ %add, %for.body ]
br label %for.cond.cleanup		br label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry		for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
%s.0.lcssa = phi i64 [ 0, %entry ], [ %add.lcssa, %for.cond.cleanup.loopexit ]		%s.0.lcssa = phi i64 [ 0, %entry ], [ %add.lcssa, %for.cond.cleanup.loopexit ]
ret i64 %s.0.lcssa		ret i64 %s.0.lcssa
}		}

; FIXME: 4x as different sext vs zext		; 4x as different sext vs zext
define i32 @red_mla_u8_s8_u32(i8* noalias nocapture readonly %A, i8* noalias nocapture readonly %B, i32 %n) #0 {		define i32 @red_mla_u8_s8_u32(i8* noalias nocapture readonly %A, i8* noalias nocapture readonly %B, i32 %n) #0 {
; CHECK-LABEL: @red_mla_u8_s8_u32(		; CHECK-LABEL: @red_mla_u8_s8_u32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP9_NOT:%.]] = icmp eq i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP9_NOT:%.]] = icmp eq i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP9_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[CMP9_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <4 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP1]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[B:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[B:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <4 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP4]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP4]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)
; CHECK-NEXT: [[TMP5:%.*]] = sext <16 x i8> [[WIDE_MASKED_LOAD1]] to <16 x i32>		; CHECK-NEXT: [[TMP5:%.*]] = sext <4 x i8> [[WIDE_MASKED_LOAD1]] to <4 x i32>
; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <16 x i32> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <4 x i32> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP6]], <16 x i32> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP6]], <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], !llvm.loop [[LOOP29:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], !llvm.loop [[LOOP29:![0-9]+]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[S_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[S_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[S_0_LCSSA]]		; CHECK-NEXT: ret i32 [[S_0_LCSSA]]
;		;
entry:		entry:
%cmp9.not = icmp eq i32 %n, 0		%cmp9.not = icmp eq i32 %n, 0
▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines