This is an archive of the discontinued LLVM Phabricator instance.

[LV][AArch64] Enable scalable vectorization of loops that contain FREM instructions
Needs ReviewPublic

Authored by jolanta.jensen on Aug 9 2023, 10:28 AM.

Download Raw Diff

Details

Reviewers

mgabka
paulwalker-arm
huntergr

Summary

This patch enables scalable vectorization of loops that contain FREM
instructions when suitable vector library is available.

Depends on D156439

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jolanta.jensen created this revision.Aug 9 2023, 10:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2023, 10:28 AM

Herald added subscribers: artagnon, hiraditya, kristof.beyls. · View Herald Transcript

jolanta.jensen requested review of this revision.Aug 9 2023, 10:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2023, 10:28 AM

Herald added subscribers: llvm-commits, wangpc, alextsao1999. · View Herald Transcript

paulwalker-arm added a reviewer: huntergr.Aug 9 2023, 10:29 AM

Harbormaster completed remote builds in B251424: Diff 548672.Aug 9 2023, 12:48 PM

huntergr added inline comments.Aug 10 2023, 2:37 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7224–7238	I don't think we need to copy this part from the above -- it's very unlikely that we'll have a cheaper version of frem with an invariant second operand.
7247–7252	I think we want to check TargetTransformInfo for a cost instead of just returning invalid if there's no vector function available. While AArch64 doesn't have a hardware frem implementation, there may be another architecture which does and we don't want to prevent that from being used. Gathering costs for a vector/scalarized frem instruction vs. a library call and returning the cheapest cost would be a good plan.

Addressing review comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7224–7238	Removed.
7247–7252	Isn't that check made on line 7233? The one calling TTI.getArthmeticInstrCost(). I gather them and return what is lower.

Harbormaster completed remote builds in B251755: Diff 549129.Aug 10 2023, 6:40 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

59 lines

test/

Transforms/

LoopVectorize/

AArch64/

frem.ll

112 lines

Diff 548672

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,182 Lines • ▼ Show 20 Lines	case Instruction::SRem:
[[fallthrough]];		[[fallthrough]];
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// If we're speculating on the stride being 1, the multiplication may		// If we're speculating on the stride being 1, the multiplication may
// fold away. We can generalize this for all operations using the notion		// fold away. We can generalize this for all operations using the notion
Show All 16 Lines	if (Op2Info.Kind == TargetTransformInfo::OK_AnyValue &&
Op2Info.Kind = TargetTransformInfo::OK_UniformValue;		Op2Info.Kind = TargetTransformInfo::OK_UniformValue;

SmallVector<const Value *, 4> Operands(I->operand_values());		SmallVector<const Value *, 4> Operands(I->operand_values());
return TTI.getArithmeticInstrCost(		return TTI.getArithmeticInstrCost(
I->getOpcode(), VectorTy, CostKind,		I->getOpcode(), VectorTy, CostKind,
{TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None},		{TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None},
Op2Info, Operands, I);		Op2Info, Operands, I);
}		}
		case Instruction::FRem: {
		// Certain instructions can be cheaper to vectorize if they have a constant
		// second vector operand. One example of this are shifts on x86.
		Value *Op2 = I->getOperand(1);
		auto Op2Info = TTI.getOperandInfo(Op2);
		if (Op2Info.Kind == TargetTransformInfo::OK_AnyValue &&
		Legal->isInvariant(Op2))
		Op2Info.Kind = TargetTransformInfo::OK_UniformValue;

		SmallVector<const Value *, 4> Operands(I->operand_values());
		InstructionCost Cost = TTI.getArithmeticInstrCost(
		I->getOpcode(), VectorTy, CostKind,
		{TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None},
		Op2Info, Operands, I);
		if (Cost != InstructionCost::getInvalid())
		return Cost;
		huntergrUnsubmitted Not Done Reply Inline Actions I don't think we need to copy this part from the above -- it's very unlikely that we'll have a cheaper version of frem with an invariant second operand. huntergr: I don't think we need to copy this part from the above -- it's very unlikely that we'll have a…
		jolanta.jensenAuthorUnsubmitted Done Reply Inline Actions Removed. jolanta.jensen: Removed.
		// We need to check if we have a lib function available as we don't want
		// to emit frem instructions operation on scalable vectors for targets
		// on which such instructions can not be code generated.
		if (VF.isScalable()) {
		if (TLI) {
		Module *M = I->getModule();
		StringRef ScalarFnName;
		Type *Ty = I->getType();
		if (Ty->isFloatTy())
		ScalarFnName = TLI->getName(LibFunc_fmodf);
		else if (Ty->isDoubleTy())
		ScalarFnName = TLI->getName(LibFunc_fmod);
		else
		return InstructionCost::getInvalid();
		huntergrUnsubmitted Not Done Reply Inline Actions I think we want to check TargetTransformInfo for a cost instead of just returning invalid if there's no vector function available. While AArch64 doesn't have a hardware frem implementation, there may be another architecture which does and we don't want to prevent that from being used. Gathering costs for a vector/scalarized frem instruction vs. a library call and returning the cheapest cost would be a good plan. huntergr: I think we want to check TargetTransformInfo for a cost instead of just returning invalid if…
		jolanta.jensenAuthorUnsubmitted Done Reply Inline Actions Isn't that check made on line 7233? The one calling TTI.getArthmeticInstrCost(). I gather them and return what is lower. jolanta.jensen: Isn't that check made on line 7233? The one calling TTI.getArthmeticInstrCost(). I gather them…
		Type *RetTy = ToVectorTy(Ty, VF);
		SmallVector<Type *> Tys = {RetTy, RetTy};
		Function *TLIFunc = nullptr;
		StringRef TLIName = TLI->getVectorizedFunction(ScalarFnName, VF);
		if (TLIName.empty()) {
		TLIName = TLI->getVectorizedFunction(ScalarFnName, VF, true);
		if (TLIName.empty())
		return InstructionCost::getInvalid();
		// Get the mask position.
		std::optional<llvm::VFInfo> Info =
		VFABI::tryDemangleForVFABI(TLIName, *M, VF);
		if (!Info)
		return InstructionCost::getInvalid();
		unsigned MaskPos = Info->getParamIndexForOptionalMask().value();
		Tys.insert(Tys.begin() + MaskPos,
		VectorType::get(Type::getInt1Ty(M->getContext()), VF));
		}
		TLIFunc = Function::Create(FunctionType::get(RetTy, Tys, false),
		Function::ExternalLinkage, ScalarFnName, *M);
		if (TLIFunc == nullptr)
		return InstructionCost::getInvalid();
		return TTI.getCallInstrCost(TLIFunc, RetTy, Tys,
		TTI::TCK_RecipThroughput);
		}
		return InstructionCost::getInvalid();
		}
		return InstructionCost::getInvalid();
		}
case Instruction::FNeg: {		case Instruction::FNeg: {
return TTI.getArithmeticInstrCost(		return TTI.getArithmeticInstrCost(
I->getOpcode(), VectorTy, CostKind,		I->getOpcode(), VectorTy, CostKind,
{TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None},		{TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None},
{TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None},		{TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None},
I->getOperand(0), I);		I->getOperand(0), I);
}		}
case Instruction::Select: {		case Instruction::Select: {
▲ Show 20 Lines • Show All 3,364 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/frem.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
				; RUN: opt -mtriple aarch64-linux-generic -mattr=+sve -vector-library=sleefgnuabi -passes=loop-vectorize,instcombine -S < %s \| FileCheck %s

				define void @fmod_vec(ptr noalias nocapture %a,
				; CHECK-LABEL: define void @fmod_vec
				; CHECK-SAME: (ptr noalias nocapture [[A:%.]], ptr noalias nocapture readonly [[B:%.]]) #[[ATTR0:[0-9]+]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds double, ptr [[B]], i64 [[INDEX]]
				; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x double>, ptr [[TMP0]], align 8
				; CHECK-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP2:%.*]] = shl nuw nsw i64 [[TMP1]], 1
				; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds double, ptr [[TMP0]], i64 [[TMP2]]
				; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 2 x double>, ptr [[TMP3]], align 8
				; CHECK-NEXT: [[TMP4:%.*]] = frem fast <vscale x 2 x double> [[WIDE_LOAD]], shufflevector (<vscale x 2 x double> insertelement (<vscale x 2 x double> poison, double 0x40091EB860000000, i64 0), <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP5:%.*]] = frem fast <vscale x 2 x double> [[WIDE_LOAD1]], shufflevector (<vscale x 2 x double> insertelement (<vscale x 2 x double> poison, double 0x40091EB860000000, i64 0), <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds double, ptr [[A]], i64 [[INDEX]]
				; CHECK-NEXT: store <vscale x 2 x double> [[TMP4]], ptr [[TMP6]], align 8
				; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP8:%.*]] = shl nuw nsw i64 [[TMP7]], 1
				; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds double, ptr [[TMP6]], i64 [[TMP8]]
				; CHECK-NEXT: store <vscale x 2 x double> [[TMP5]], ptr [[TMP9]], align 8
				; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP11:%.*]] = shl nuw nsw i64 [[TMP10]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP11]]
				; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
				; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: br i1 poison, label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP3:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				ptr noalias nocapture readonly %b) #0 {
				entry:
				br label %for.body
				for.body: ; preds = %entry, %for.body
				%i = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds double, ptr %b, i64 %i
				%0 = load double, ptr %arrayidx, align 8
				%1 = frem fast double %0, 0x40091EB860000000
				%arrayidx2 = getelementptr inbounds double, ptr %a, i64 %i
				store double %1, ptr %arrayidx2, align 8
				%inc = add nuw nsw i64 %i, 1
				%cmp = icmp ult i64 %inc, 256
				br i1 %cmp, label %for.body, label %for.end
				for.end: ; preds = %for.body
				ret void
				}

				define void @fmodf_vec(ptr noalias nocapture %a,
				; CHECK-LABEL: define void @fmodf_vec
				; CHECK-SAME: (ptr noalias nocapture [[A:%.]], ptr noalias nocapture readonly [[B:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[INDEX]]
				; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP0]], align 4
				; CHECK-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP2:%.*]] = shl nuw nsw i64 [[TMP1]], 2
				; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds float, ptr [[TMP0]], i64 [[TMP2]]
				; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 4 x float>, ptr [[TMP3]], align 4
				; CHECK-NEXT: [[TMP4:%.*]] = frem fast <vscale x 4 x float> [[WIDE_LOAD]], shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 0x40091EB860000000, i64 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP5:%.*]] = frem fast <vscale x 4 x float> [[WIDE_LOAD1]], shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 0x40091EB860000000, i64 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[INDEX]]
				; CHECK-NEXT: store <vscale x 4 x float> [[TMP4]], ptr [[TMP6]], align 4
				; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP8:%.*]] = shl nuw nsw i64 [[TMP7]], 2
				; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds float, ptr [[TMP6]], i64 [[TMP8]]
				; CHECK-NEXT: store <vscale x 4 x float> [[TMP5]], ptr [[TMP9]], align 4
				; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP11:%.*]] = shl nuw nsw i64 [[TMP10]], 3
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP11]]
				; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
				; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: br i1 poison, label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP5:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				ptr noalias nocapture readonly %b) #0 {
				entry:
				br label %for.body
				for.body: ; preds = %entry, %for.body
				%i = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds float, ptr %b, i64 %i
				%0 = load float, ptr %arrayidx, align 4
				%1 = frem fast float %0, 0x40091EB860000000
				%arrayidx2 = getelementptr inbounds float, ptr %a, i64 %i
				store float %1, ptr %arrayidx2, align 4
				%inc = add nuw nsw i64 %i, 1
				%cmp = icmp ult i64 %inc, 256
				br i1 %cmp, label %for.body, label %for.end
				for.end: ; preds = %for.body
				ret void
				}

				attributes #0 = { vscale_range(1,16) }