This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
-
RISCVTargetTransformInfo.h
1/2
RISCVTargetTransformInfo.cpp
-
test/Transforms/LoopVectorize/RISCV/
-
Transforms/
-
LoopVectorize/
-
RISCV/
-
inloop-reduction.ll

Differential D129994

[RISCV] Add cost modelling for vector widenning integer reduction.
ClosedPublic

Authored by jacquesguan on Jul 18 2022, 2:53 AM.

Download Raw Diff

Details

Reviewers

reames
craig.topper
asb
luismarques
frasercrmck
benshi001

Commits

rGb61cfc91eac8: [RISCV] Add cost modelling for vector widenning reduction.

Summary

In RVV, we use vwredsum.vs and vwredsumu.vs for vecreduce.add(ext(Ty A)) if the result type's width is twice of the input vector's SEW-width. In this situation, the cost of extended add reduction should be same as single-width add reduction.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,110 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp

Event Timeline

jacquesguan created this revision.Jul 18 2022, 2:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2022, 2:53 AM

Herald added subscribers: sunshaoce, VincentWu, luke957 and 27 others. · View Herald Transcript

jacquesguan requested review of this revision.Jul 18 2022, 2:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2022, 2:53 AM

Herald added subscribers: llvm-commits, • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B175990: Diff 445439.Jul 18 2022, 4:30 AM

ping.

Tests?

Also, how are reduction opcodes other than mul and add handled?

This revision now requires changes to proceed.Jul 27 2022, 1:05 PM

In D129994#3683293, @reames wrote:

Tests?

Also, how are reduction opcodes other than mul and add handled?

Sorry, I don't get your point, getExtendedAddReductionCost is only used for gettring the cost of vecreduce.add(ext) or vecreduce.add(mul(ext, ext)) if IsMLA is true. So here we only need to handle ADD.

And for the test, I do not know well about CostModel test, vecreduce.add(ext) needs at least 2 instruction to make, but the test case would show the cost of each instruction, so any advice for making the case?

In D129994#3684013, @jacquesguan wrote:

In D129994#3683293, @reames wrote:

Tests?

Also, how are reduction opcodes other than mul and add handled?

Sorry, I don't get your point, getExtendedAddReductionCost is only used for gettring the cost of vecreduce.add(ext) or vecreduce.add(mul(ext, ext)) if IsMLA is true. So here we only need to handle ADD.

I had missed the Add in the function name and was assuming this was generic to widening reductions. I was thinking this API would be used for the floating point widening reduction variants as well, which appears not to be the case.

Honestly, this seems like a pretty poor API design. Not having thought about this extensively, it seems like splitting this into two APIs one which handles the generic mixed width reduction case (e.g. vecreduce.opcode(ext(Ty A))), and one which adds the dot-product acceleration case (e.g. vecreduce.add(mul(ext(Ty A), ext(Ty B))) would make more sense. Having a getExtendedReductionCost API would allow you to handle the FADD case, and potentially any future widen reduction instructions. (I don't have any particular extension in mind here, just a general concern.)

Huh, actually it's worse than that. The loop vectorizer appears to also be using this function for both reduce(ext(mul(ext(A), ext(B))) and reduce(mul(ext(A), ext(B)) - note difference in outer extend - and is inconsistent about values for MLA are passed in. The current usage appears to contract the documented interface for the existing routine.

You should fix that before we continue with this patch.

And for the test, I do not know well about CostModel test, vecreduce.add(ext) needs at least 2 instruction to make, but the test case would show the cost of each instruction, so any advice for making the case?

If nothing else, you could write a vectorizer test which checked the cost-model output. Not ideal, but possible.

In D129994#3685806, @reames wrote:

In D129994#3684013, @jacquesguan wrote:

In D129994#3683293, @reames wrote:

Tests?

Also, how are reduction opcodes other than mul and add handled?

Sorry, I don't get your point, getExtendedAddReductionCost is only used for gettring the cost of vecreduce.add(ext) or vecreduce.add(mul(ext, ext)) if IsMLA is true. So here we only need to handle ADD.

I had missed the Add in the function name and was assuming this was generic to widening reductions. I was thinking this API would be used for the floating point widening reduction variants as well, which appears not to be the case.

Honestly, this seems like a pretty poor API design. Not having thought about this extensively, it seems like splitting this into two APIs one which handles the generic mixed width reduction case (e.g. vecreduce.opcode(ext(Ty A))), and one which adds the dot-product acceleration case (e.g. vecreduce.add(mul(ext(Ty A), ext(Ty B))) would make more sense. Having a getExtendedReductionCost API would allow you to handle the FADD case, and potentially any future widen reduction instructions. (I don't have any particular extension in mind here, just a general concern.)

Huh, actually it's worse than that. The loop vectorizer appears to also be using this function for both reduce(ext(mul(ext(A), ext(B))) and reduce(mul(ext(A), ext(B)) - note difference in outer extend - and is inconsistent about values for MLA are passed in. The current usage appears to contract the documented interface for the existing routine.

You should fix that before we continue with this patch.

And for the test, I do not know well about CostModel test, vecreduce.add(ext) needs at least 2 instruction to make, but the test case would show the cost of each instruction, so any advice for making the case?

If nothing else, you could write a vectorizer test which checked the cost-model output. Not ideal, but possible.

I created a new revision https://reviews.llvm.org/D130868 to try to refactor this API, would you like to review it?

Refactor the code.

Harbormaster completed remote builds in B178737: Diff 449248.Aug 2 2022, 5:30 AM

Thanks for the rework of the API, much improved!

I'd still *prefer* a vectorizer test here, but won't strictly require one.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

397

It's not really clear to me why you bother to translate to ISD here. The code would be much clearer as:

if (Opcode != Add && Opcode != FAdd)
  return ...
if (width check fails)
  return ...

std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
return (LT.first - 1) +
             getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);

This revision now requires changes to proceed.Aug 2 2022, 10:26 AM

Address comment and add test.

In D129994#3694016, @reames wrote:

Thanks for the rework of the API, much improved!

I'd still *prefer* a vectorizer test here, but won't strictly require one.

I add a test which will show the difference between the inloop reduction and outloop. In RISCV, we could use widening reduction instruction if we do the inloop reduction. But the API preferInLoopReduction maybe need some adjustment to enable the inloop reduction for the extended reduction situation, I will change it in a new revision.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
397	Done.

Harbormaster completed remote builds in B179007: Diff 449628.Aug 3 2022, 5:45 AM

LGTM

If I read your last comment correctly, the new test doesn't actually hit the relevant cost model change without follow up work right? If so, please land it separately - as having cover for inloop reduction seems useful - and then land the code change on its own. Given the false impression that the code is tested by landing them together would be potentially confusing,

Side note, I don't think we're going to ever prefer inloop reductions over out of loop ones. Reduction instructions are generally log(VL) complexity or worse. Unless maybe there's a case I'm missing?

This revision is now accepted and ready to land.Aug 3 2022, 7:26 AM

This revision was landed with ongoing or failed builds.Aug 4 2022, 12:32 AM

Closed by commit rGb61cfc91eac8: [RISCV] Add cost modelling for vector widenning reduction. (authored by jacquesguan). · Explain Why

This revision was automatically updated to reflect the committed changes.

jacquesguan added a commit: rGb61cfc91eac8: [RISCV] Add cost modelling for vector widenning reduction..

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVTargetTransformInfo.h

5 lines

RISCVTargetTransformInfo.cpp

26 lines

test/

Transforms/

LoopVectorize/

RISCV/

inloop-reduction.ll

169 lines

Diff 449628

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	public:
InstructionCost getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		InstructionCost getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsUnsigned,		bool IsUnsigned,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

InstructionCost getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		InstructionCost getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
Optional<FastMathFlags> FMF,		Optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

		InstructionCost getExtendedReductionCost(unsigned Opcode, bool IsUnsigned,
		Type ResTy, VectorType ValTy,
		Optional<FastMathFlags> FMF,
		TTI::TargetCostKind CostKind);

bool isElementTypeLegalForScalableVector(Type *Ty) const {		bool isElementTypeLegalForScalableVector(Type *Ty) const {
return TLI->isLegalElementTypeForRVV(Ty);		return TLI->isLegalElementTypeForRVV(Ty);
}		}

bool isLegalMaskedLoadStore(Type *DataType, Align Alignment) {		bool isLegalMaskedLoadStore(Type *DataType, Align Alignment) {
if (!ST->hasVInstructions())		if (!ST->hasVInstructions())
return false;		return false;

▲ Show 20 Lines • Show All 156 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

Show First 20 Lines • Show All 371 Lines • ▼ Show 20 Lines	RISCVTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
// IR Reduction is composed by two vmv and one rvv reduction instruction.		// IR Reduction is composed by two vmv and one rvv reduction instruction.
InstructionCost BaseCost = 2;		InstructionCost BaseCost = 2;
unsigned VL = getMaxVLFor(Ty);		unsigned VL = getMaxVLFor(Ty);
if (TTI::requiresOrderedReduction(FMF))		if (TTI::requiresOrderedReduction(FMF))
return (LT.first - 1) + BaseCost + VL;		return (LT.first - 1) + BaseCost + VL;
return (LT.first - 1) + BaseCost + Log2_32_Ceil(VL);		return (LT.first - 1) + BaseCost + Log2_32_Ceil(VL);
}		}

		InstructionCost RISCVTTIImpl::getExtendedReductionCost(
		unsigned Opcode, bool IsUnsigned, Type ResTy, VectorType ValTy,
		Optional<FastMathFlags> FMF, TTI::TargetCostKind CostKind) {
		if (isa<FixedVectorType>(ValTy) && !ST->useRVVForFixedLengthVectors())
		return BaseT::getExtendedReductionCost(Opcode, IsUnsigned, ResTy, ValTy,
		FMF, CostKind);

		// Skip if scalar size of ResTy is bigger than ELEN.
		if (ResTy->getScalarSizeInBits() > ST->getELEN())
		return BaseT::getExtendedReductionCost(Opcode, IsUnsigned, ResTy, ValTy,
		FMF, CostKind);

		if (Opcode != Instruction::Add && Opcode != Instruction::FAdd)
		return BaseT::getExtendedReductionCost(Opcode, IsUnsigned, ResTy, ValTy,
		FMF, CostKind);

		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);

		reamesUnsubmitted Not Done Reply Inline Actions It's not really clear to me why you bother to translate to ISD here. The code would be much clearer as: if (Opcode != Add && Opcode != FAdd) return ... if (width check fails) return ... std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy); return (LT.first - 1) + getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind); reames: It's not really clear to me why you bother to translate to ISD here. The code would be much…
		jacquesguanAuthorUnsubmitted Done Reply Inline Actions Done. jacquesguan: Done.
		if (ResTy->getScalarSizeInBits() != 2 * LT.second.getScalarSizeInBits())
		return BaseT::getExtendedReductionCost(Opcode, IsUnsigned, ResTy, ValTy,
		FMF, CostKind);

		return (LT.first - 1) +
		getArithmeticReductionCost(Opcode, ValTy, FMF, CostKind);
		}

void RISCVTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void RISCVTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP,		TTI::UnrollingPreferences &UP,
OptimizationRemarkEmitter *ORE) {		OptimizationRemarkEmitter *ORE) {
// TODO: More tuning on benchmarks and metrics with changes as needed		// TODO: More tuning on benchmarks and metrics with changes as needed
// would apply to all settings below to enable performance.		// would apply to all settings below to enable performance.


if (ST->enableDefaultUnroll())		if (ST->enableDefaultUnroll())
▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple riscv64-linux-gnu -mattr=+v,+d -loop-vectorize < %s -S -o - \| FileCheck %s -check-prefix=OUTLOOP
				; RUN: opt -mtriple riscv64-linux-gnu -mattr=+v,+d -loop-vectorize -prefer-inloop-reductions < %s -S -o - \| FileCheck %s -check-prefix=INLOOP


				target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n64-S128"
				target triple = "riscv64"

				define i32 @add_i16_i32(i16* nocapture readonly %x, i32 %n) {
				; OUTLOOP-LABEL: @add_i16_i32(
				; OUTLOOP-NEXT: entry:
				; OUTLOOP-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
				; OUTLOOP-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
				; OUTLOOP: for.body.preheader:
				; OUTLOOP-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
				; OUTLOOP-NEXT: [[TMP1:%.*]] = mul i32 [[TMP0]], 4
				; OUTLOOP-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], [[TMP1]]
				; OUTLOOP-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; OUTLOOP: vector.ph:
				; OUTLOOP-NEXT: [[TMP2:%.*]] = call i32 @llvm.vscale.i32()
				; OUTLOOP-NEXT: [[TMP3:%.*]] = mul i32 [[TMP2]], 4
				; OUTLOOP-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N]], [[TMP3]]
				; OUTLOOP-NEXT: [[N_VEC:%.*]] = sub i32 [[N]], [[N_MOD_VF]]
				; OUTLOOP-NEXT: br label [[VECTOR_BODY:%.*]]
				; OUTLOOP: vector.body:
				; OUTLOOP-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; OUTLOOP-NEXT: [[VEC_PHI:%.]] = phi <vscale x 2 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.]], [[VECTOR_BODY]] ]
				; OUTLOOP-NEXT: [[VEC_PHI1:%.]] = phi <vscale x 2 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP21:%.]], [[VECTOR_BODY]] ]
				; OUTLOOP-NEXT: [[TMP4:%.*]] = add i32 [[INDEX]], 0
				; OUTLOOP-NEXT: [[TMP5:%.*]] = call i32 @llvm.vscale.i32()
				; OUTLOOP-NEXT: [[TMP6:%.*]] = mul i32 [[TMP5]], 2
				; OUTLOOP-NEXT: [[TMP7:%.*]] = add i32 [[TMP6]], 0
				; OUTLOOP-NEXT: [[TMP8:%.*]] = mul i32 [[TMP7]], 1
				; OUTLOOP-NEXT: [[TMP9:%.*]] = add i32 [[INDEX]], [[TMP8]]
				; OUTLOOP-NEXT: [[TMP10:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[TMP4]]
				; OUTLOOP-NEXT: [[TMP11:%.]] = getelementptr inbounds i16, i16 [[X]], i32 [[TMP9]]
				; OUTLOOP-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 [[TMP10]], i32 0
				; OUTLOOP-NEXT: [[TMP13:%.]] = bitcast i16 [[TMP12]] to <vscale x 2 x i16>*
				; OUTLOOP-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 2 x i16>, <vscale x 2 x i16> [[TMP13]], align 2
				; OUTLOOP-NEXT: [[TMP14:%.*]] = call i32 @llvm.vscale.i32()
				; OUTLOOP-NEXT: [[TMP15:%.*]] = mul i32 [[TMP14]], 2
				; OUTLOOP-NEXT: [[TMP16:%.]] = getelementptr inbounds i16, i16 [[TMP10]], i32 [[TMP15]]
				; OUTLOOP-NEXT: [[TMP17:%.]] = bitcast i16 [[TMP16]] to <vscale x 2 x i16>*
				; OUTLOOP-NEXT: [[WIDE_LOAD2:%.]] = load <vscale x 2 x i16>, <vscale x 2 x i16> [[TMP17]], align 2
				; OUTLOOP-NEXT: [[TMP18:%.*]] = sext <vscale x 2 x i16> [[WIDE_LOAD]] to <vscale x 2 x i32>
				; OUTLOOP-NEXT: [[TMP19:%.*]] = sext <vscale x 2 x i16> [[WIDE_LOAD2]] to <vscale x 2 x i32>
				; OUTLOOP-NEXT: [[TMP20]] = add <vscale x 2 x i32> [[VEC_PHI]], [[TMP18]]
				; OUTLOOP-NEXT: [[TMP21]] = add <vscale x 2 x i32> [[VEC_PHI1]], [[TMP19]]
				; OUTLOOP-NEXT: [[TMP22:%.*]] = call i32 @llvm.vscale.i32()
				; OUTLOOP-NEXT: [[TMP23:%.*]] = mul i32 [[TMP22]], 4
				; OUTLOOP-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP23]]
				; OUTLOOP-NEXT: [[TMP24:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
				; OUTLOOP-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; OUTLOOP: middle.block:
				; OUTLOOP-NEXT: [[BIN_RDX:%.*]] = add <vscale x 2 x i32> [[TMP21]], [[TMP20]]
				; OUTLOOP-NEXT: [[TMP25:%.*]] = call i32 @llvm.vector.reduce.add.nxv2i32(<vscale x 2 x i32> [[BIN_RDX]])
				; OUTLOOP-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N]], [[N_VEC]]
				; OUTLOOP-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
				; OUTLOOP: scalar.ph:
				; OUTLOOP-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
				; OUTLOOP-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[TMP25]], [[MIDDLE_BLOCK]] ]
				; OUTLOOP-NEXT: br label [[FOR_BODY:%.*]]
				; OUTLOOP: for.body:
				; OUTLOOP-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; OUTLOOP-NEXT: [[R_07:%.]] = phi i32 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
				; OUTLOOP-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X]], i32 [[I_08]]
				; OUTLOOP-NEXT: [[TMP26:%.]] = load i16, i16 [[ARRAYIDX]], align 2
				; OUTLOOP-NEXT: [[CONV:%.*]] = sext i16 [[TMP26]] to i32
				; OUTLOOP-NEXT: [[ADD]] = add nsw i32 [[R_07]], [[CONV]]
				; OUTLOOP-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
				; OUTLOOP-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
				; OUTLOOP-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
				; OUTLOOP: for.cond.cleanup.loopexit:
				; OUTLOOP-NEXT: [[ADD_LCSSA:%.*]] = phi i32 [ [[ADD]], [[FOR_BODY]] ], [ [[TMP25]], [[MIDDLE_BLOCK]] ]
				; OUTLOOP-NEXT: br label [[FOR_COND_CLEANUP]]
				; OUTLOOP: for.cond.cleanup:
				; OUTLOOP-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[ADD_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT]] ]
				; OUTLOOP-NEXT: ret i32 [[R_0_LCSSA]]
				;
				; INLOOP-LABEL: @add_i16_i32(
				; INLOOP-NEXT: entry:
				; INLOOP-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
				; INLOOP-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
				; INLOOP: for.body.preheader:
				; INLOOP-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
				; INLOOP-NEXT: [[TMP1:%.*]] = mul i32 [[TMP0]], 8
				; INLOOP-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], [[TMP1]]
				; INLOOP-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; INLOOP: vector.ph:
				; INLOOP-NEXT: [[TMP2:%.*]] = call i32 @llvm.vscale.i32()
				; INLOOP-NEXT: [[TMP3:%.*]] = mul i32 [[TMP2]], 8
				; INLOOP-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N]], [[TMP3]]
				; INLOOP-NEXT: [[N_VEC:%.*]] = sub i32 [[N]], [[N_MOD_VF]]
				; INLOOP-NEXT: br label [[VECTOR_BODY:%.*]]
				; INLOOP: vector.body:
				; INLOOP-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; INLOOP-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP21:%.]], [[VECTOR_BODY]] ]
				; INLOOP-NEXT: [[VEC_PHI1:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP23:%.]], [[VECTOR_BODY]] ]
				; INLOOP-NEXT: [[TMP4:%.*]] = add i32 [[INDEX]], 0
				; INLOOP-NEXT: [[TMP5:%.*]] = call i32 @llvm.vscale.i32()
				; INLOOP-NEXT: [[TMP6:%.*]] = mul i32 [[TMP5]], 4
				; INLOOP-NEXT: [[TMP7:%.*]] = add i32 [[TMP6]], 0
				; INLOOP-NEXT: [[TMP8:%.*]] = mul i32 [[TMP7]], 1
				; INLOOP-NEXT: [[TMP9:%.*]] = add i32 [[INDEX]], [[TMP8]]
				; INLOOP-NEXT: [[TMP10:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[TMP4]]
				; INLOOP-NEXT: [[TMP11:%.]] = getelementptr inbounds i16, i16 [[X]], i32 [[TMP9]]
				; INLOOP-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 [[TMP10]], i32 0
				; INLOOP-NEXT: [[TMP13:%.]] = bitcast i16 [[TMP12]] to <vscale x 4 x i16>*
				; INLOOP-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 4 x i16>, <vscale x 4 x i16> [[TMP13]], align 2
				; INLOOP-NEXT: [[TMP14:%.*]] = call i32 @llvm.vscale.i32()
				; INLOOP-NEXT: [[TMP15:%.*]] = mul i32 [[TMP14]], 4
				; INLOOP-NEXT: [[TMP16:%.]] = getelementptr inbounds i16, i16 [[TMP10]], i32 [[TMP15]]
				; INLOOP-NEXT: [[TMP17:%.]] = bitcast i16 [[TMP16]] to <vscale x 4 x i16>*
				; INLOOP-NEXT: [[WIDE_LOAD2:%.]] = load <vscale x 4 x i16>, <vscale x 4 x i16> [[TMP17]], align 2
				; INLOOP-NEXT: [[TMP18:%.*]] = sext <vscale x 4 x i16> [[WIDE_LOAD]] to <vscale x 4 x i32>
				; INLOOP-NEXT: [[TMP19:%.*]] = sext <vscale x 4 x i16> [[WIDE_LOAD2]] to <vscale x 4 x i32>
				; INLOOP-NEXT: [[TMP20:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP18]])
				; INLOOP-NEXT: [[TMP21]] = add i32 [[TMP20]], [[VEC_PHI]]
				; INLOOP-NEXT: [[TMP22:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP19]])
				; INLOOP-NEXT: [[TMP23]] = add i32 [[TMP22]], [[VEC_PHI1]]
				; INLOOP-NEXT: [[TMP24:%.*]] = call i32 @llvm.vscale.i32()
				; INLOOP-NEXT: [[TMP25:%.*]] = mul i32 [[TMP24]], 8
				; INLOOP-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP25]]
				; INLOOP-NEXT: [[TMP26:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
				; INLOOP-NEXT: br i1 [[TMP26]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; INLOOP: middle.block:
				; INLOOP-NEXT: [[BIN_RDX:%.*]] = add i32 [[TMP23]], [[TMP21]]
				; INLOOP-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N]], [[N_VEC]]
				; INLOOP-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
				; INLOOP: scalar.ph:
				; INLOOP-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
				; INLOOP-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[BIN_RDX]], [[MIDDLE_BLOCK]] ]
				; INLOOP-NEXT: br label [[FOR_BODY:%.*]]
				; INLOOP: for.body:
				; INLOOP-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; INLOOP-NEXT: [[R_07:%.]] = phi i32 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
				; INLOOP-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X]], i32 [[I_08]]
				; INLOOP-NEXT: [[TMP27:%.]] = load i16, i16 [[ARRAYIDX]], align 2
				; INLOOP-NEXT: [[CONV:%.*]] = sext i16 [[TMP27]] to i32
				; INLOOP-NEXT: [[ADD]] = add nsw i32 [[R_07]], [[CONV]]
				; INLOOP-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
				; INLOOP-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
				; INLOOP-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
				; INLOOP: for.cond.cleanup.loopexit:
				; INLOOP-NEXT: [[ADD_LCSSA:%.*]] = phi i32 [ [[ADD]], [[FOR_BODY]] ], [ [[BIN_RDX]], [[MIDDLE_BLOCK]] ]
				; INLOOP-NEXT: br label [[FOR_COND_CLEANUP]]
				; INLOOP: for.cond.cleanup:
				; INLOOP-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[ADD_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT]] ]
				; INLOOP-NEXT: ret i32 [[R_0_LCSSA]]
				;
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body, label %for.cond.cleanup

				for.body: ; preds = %entry, %for.body
				%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
				%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08
				%0 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %0 to i32
				%add = add nsw i32 %r.07, %conv
				%inc = add nuw nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %inc, %n
				br i1 %exitcond, label %for.cond.cleanup, label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
				ret i32 %r.0.lcssa
				}

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Add cost modelling for vector widenning integer reduction.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 449628

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll

[RISCV] Add cost modelling for vector widenning integer reduction.
ClosedPublic