This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Use an interleave count of 1 when using a vector library call
Needs ReviewPublic

Authored by rob.lougher on Jun 14 2018, 1:14 PM.

Download Raw Diff

Details

Reviewers

mkuper
hfinkel
mssimpso
RKSimon

Summary

Given the following test program:

#include <math.h>

void test(float *a, float *b, int n) {
  for(int i = 0; i < n; i++)
    b[i] = sinf(a[i]);
}

If we tell the compiler we have a vector-library available and compile it as follows:

$ clang -O2 --target=x86_64-unknown-linux -march=btver2 -mllvm -vector-library=SVML -S test.c

The loop will be vectorized with a vectorization factor of 8, and the call to sinf will be widened to a vector library call (__svml_sinf8):

.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovups	(%r12,%r13,4), %ymm0
	vmovups	32(%r12,%r13,4), %ymm1
	vmovups	64(%r12,%r13,4), %ymm3
	vmovups	96(%r12,%r13,4), %ymm2
	vmovups	%ymm1, (%rsp)           # 32-byte Spill
	vmovups	%ymm3, 32(%rsp)         # 32-byte Spill
	vmovups	%ymm2, 96(%rsp)         # 32-byte Spill
	callq	__svml_sinf8
	vmovups	%ymm0, 64(%rsp)         # 32-byte Spill
	vmovups	(%rsp), %ymm0           # 32-byte Reload
	callq	__svml_sinf8
	vmovups	%ymm0, (%rsp)           # 32-byte Spill
	vmovups	32(%rsp), %ymm0         # 32-byte Reload
	callq	__svml_sinf8
	vmovups	%ymm0, 32(%rsp)         # 32-byte Spill
	vmovups	96(%rsp), %ymm0         # 32-byte Reload
	callq	__svml_sinf8
	vmovups	64(%rsp), %ymm1         # 32-byte Reload
	vmovups	(%rsp), %ymm3           # 32-byte Reload
	vmovups	32(%rsp), %ymm2         # 32-byte Reload
	vmovups	%ymm1, (%r14,%r13,4)
	vmovups	%ymm3, 32(%r14,%r13,4)
	vmovups	%ymm2, 64(%r14,%r13,4)
	vmovups	%ymm0, 96(%r14,%r13,4)
	addq	$32, %r13
	cmpq	%r13, %rbx
	jne	.LBB0_6

However, as can be seen the code generated is poor, containing a large number of spills and reloads. The reason for this is the loop vectorizer has chosen an interleave count (aka unroll factor) of 4.

In general, the heuristics tries to create parallel instances of the loop to expose ILP without causing spilling. It bases this on the number of registers used in the loop and the number of registers available. However, due to the way instructions are interleaved, the vector call causes the registers for the other instances to be spilled (thus defeating the heuristics).

This patch changes the heuristics to use an interleave count of 1 when a call will be vectorized to a library call. The test above now generates:

.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovups	(%r12,%r13,4), %ymm0
	callq	__svml_sinf8
	vmovups	%ymm0, (%r14,%r13,4)
	addq	$8, %r13
	cmpq	%r13, %rbx
	jne	.LBB0_6

Diff Detail

Event Timeline

rob.lougher created this revision.Jun 14 2018, 1:14 PM

Herald added a subscriber: dmgreen. · View Herald TranscriptJun 14 2018, 1:14 PM

Hi Robert,

thanks for bringing this up! This approach is blindly setting the interleave factor to 1 when there are vector math function calls. I have the following questions/comments:

Maybe I'm missing something but, wouldn't the same problem happen when the function calls are scalar or for any arbitrary function call (not necessarily math functions)? Why should we do this for vector math function calls only?
I'm concerned about this change introducing performance regressions. For example, imaging a loop body where the total gain of interleaving overcomes the penalty of the register spilling caused by the function call. Wouldn't it be better to properly model this particular register spilling penalty in the context of function calls instead of blindly disabling interleaving for those cases?

Thanks,
Diego

In D48193#1132940, @dcaballe wrote:

Hi Robert,

thanks for bringing this up! This approach is blindly setting the interleave factor to 1 when there are vector math function calls. I have the following questions/comments:

Thanks for responding. Your questions are ones which I considered while doing the patch, so I suspected I would be asked them...

Maybe I'm missing something but, wouldn't the same problem happen when the function calls are scalar or for any arbitrary function call (not necessarily math functions)? Why should we do this for vector math function calls only?

Legalization doesn't allow arbitrary function calls (it must be an intrinsic or a library call). But yes, a vector call may be scalarized, or a widened intrinsic may be lowered back to scalar library calls. But in this case we're still going to be spilling/reloading all over the place with an IC of 1. The point for doing it for vector library calls, it that currently it generates poor code and we can fix it for them with a simple change.

I'm concerned about this change introducing performance regressions. For example, imaging a loop body where the total gain of interleaving overcomes the penalty of the register spilling caused by the function call. Wouldn't it be better to properly model this particular register spilling penalty in the context of function calls instead of blindly disabling interleaving for those cases?

From what I can see the loop vectorizer is conservative, and its model of the target is very basic (see the register usage calculation, and assumptions such as the number of load/store ports being the max interleave count). Trying to add ABI considerations and spill cost calculation at the level of the loop vectorizer will be difficult. In the case of vector library calls we can clearly see a codegen issue, and setting IC=1 in this case is conservative.

Thanks,
Diego

In D48193#1133031, @rob.lougher wrote:

In D48193#1132940, @dcaballe wrote:

Hi Robert,

I'm concerned about this change introducing performance regressions. For example, imaging a loop body where the total gain of interleaving overcomes the penalty of the register spilling caused by the function call. Wouldn't it be better to properly model this particular register spilling penalty in the context of function calls instead of blindly disabling interleaving for those cases?

From what I can see the loop vectorizer is conservative, and its model of the target is very basic (see the register usage calculation, and assumptions such as the number of load/store ports being the max interleave count). Trying to add ABI considerations and spill cost calculation at the level of the loop vectorizer will be difficult. In the case of vector library calls we can clearly see a codegen issue, and setting IC=1 in this case is conservative.

I forgot to say that in the case of loops without reductions, only small loops are interleaved (i.e. a loop cost less than 20). For reference, the loop above, (if I remember correctly, I'm not in work now) has a cost of 12 (a call cost of 10 plus 1 each for the store and load). So it's unlikely that a small loop could overcome the cost of spilling, as there's not a lot of room left for extra instructions.

Thanks again for the questions...

Thanks,
Diego

rob.lougher added subscribers: gbedwell, andreadb, RKSimon.Jun 14 2018, 3:42 PM

I see this as a register allocator problem. It's not like we are running out of registers so that we cannot use ymm0 as a "scratch register" for SVML call. We should show the ASM code to CG experts and get the problem fixed there.

Assuming that register allocator can fix simple enough issues..... I don't think it's correct to model this as a VECLIB call problem. It can happen to any function call that use too many registers that can't be shared among interleaved calls. Instead of looking at whether it's a VECLIB call or not, we should be checking how many registers are used for call/return, and how many of them cannot be shared among interleaved calls. We can then formulate this into a general register pressure issue.

The register allocator is very constrained on its ability to rematerialize the loads to avoid the spill and reload. It would require the pointers for the address computation to be available on the other side of the call which may require spills/reloads of that register.

In D48193#1133065, @hsaito wrote:

I see this as a register allocator problem. It's not like we are running out of registers so that we cannot use ymm0 as a "scratch register" for SVML call. We should show the ASM code to CG experts and get the problem fixed there.

Assuming that register allocator can fix simple enough issues..... I don't think it's correct to model this as a VECLIB call problem. It can happen to any function call that use too many registers that can't be shared among interleaved calls. Instead of looking at whether it's a VECLIB call or not, we should be checking how many registers are used for call/return, and how many of them cannot be shared among interleaved calls. We can then formulate this into a general register pressure issue.

The call uses one register but causes all other live values at the call location to be spilled (the other live values are caused by the interleaving). To model this we would need to know the calling-convention of the target (which registers are preserved), and also which registers the live values will end up in. This isn't known until after register allocation.

In D48193#1133083, @rob.lougher wrote:

In D48193#1133065, @hsaito wrote:

I see this as a register allocator problem. It's not like we are running out of registers so that we cannot use ymm0 as a "scratch register" for SVML call. We should show the ASM code to CG experts and get the problem fixed there.

Assuming that register allocator can fix simple enough issues..... I don't think it's correct to model this as a VECLIB call problem. It can happen to any function call that use too many registers that can't be shared among interleaved calls. Instead of looking at whether it's a VECLIB call or not, we should be checking how many registers are used for call/return, and how many of them cannot be shared among interleaved calls. We can then formulate this into a general register pressure issue.

The call uses one register but causes all other live values at the call location to be spilled (the other live values are caused by the interleaving). To model this we would need to know the calling-convention of the target (which registers are preserved), and also which registers the live values will end up in. This isn't known until after register allocation.

Craig topper told me that LLVM currently doesn't have any special knowledge on SVML's register usage. I just checked ICC behavior. ICC uses reg-reg move. So, there appears to be a room for improvement in that area.
I think it's better to look into that first. Doing something blindly like this look easy but I hate to see unnecessary restrictions to be imposed. Even if we don't have a great accuracy here, we should still try to do some accounting, especially so if the calls are to something that's better behaving than the worst case.

What's the value of Legal checking MayHaveVectorLibCall()? Cost model can tell whether the call is vector or scalar on its own for each VF, and that's already happened by the time selectInterleaveCount() is called.

In D48193#1133104, @hsaito wrote:

In D48193#1133083, @rob.lougher wrote:

In D48193#1133065, @hsaito wrote:

I see this as a register allocator problem. It's not like we are running out of registers so that we cannot use ymm0 as a "scratch register" for SVML call. We should show the ASM code to CG experts and get the problem fixed there.

Assuming that register allocator can fix simple enough issues..... I don't think it's correct to model this as a VECLIB call problem. It can happen to any function call that use too many registers that can't be shared among interleaved calls. Instead of looking at whether it's a VECLIB call or not, we should be checking how many registers are used for call/return, and how many of them cannot be shared among interleaved calls. We can then formulate this into a general register pressure issue.

The call uses one register but causes all other live values at the call location to be spilled (the other live values are caused by the interleaving). To model this we would need to know the calling-convention of the target (which registers are preserved), and also which registers the live values will end up in. This isn't known until after register allocation.

Craig topper told me that LLVM currently doesn't have any special knowledge on SVML's register usage. I just checked ICC behavior. ICC uses reg-reg move. So, there appears to be a room for improvement in that area.
I think it's better to look into that first. Doing something blindly like this look easy but I hate to see unnecessary restrictions to be imposed. Even if we don't have a great accuracy here, we should still try to do some accounting, especially so if the calls are to something that's better behaving than the worst case.

What's the value of Legal checking MayHaveVectorLibCall()? Cost model can tell whether the call is vector or scalar on its own for each VF, and that's already happened by the time selectInterleaveCount() is called.

I'm not an expert on the loop vectorizer. From what I can see it checks to see if a vector lib call exists in 4 places:

Legalization
Planning stage
Cost model
Plan execution

Legalization just checks if we can vectorize the call (vector lib or intrinsic). We don't know at this stage if the call will use the vector lib as it depends on the vectorization factor.

The planning stage tries various permutations of vectorization factor ranges. Again, the planning decision is whether the call can be vectorized, which means it could be either an intrinsic or a vector lib call (within a VF range, one VF might use the intrinsic and one the vector lib, which makes it difficult to record).

The cost model is then used to calculate the expected loop cost for each VF within the VF range from planning.

At this stage the VF is then chosen (the cheapest cost).

Until this point we have been dealing with several possible VFs. It would be possible to store within the cost model whether the vector lib call was to be used against each queried VF, but the cost model looks relatively stateless (I may be wrong but that was my impression).

So I decided to add an extra pass after the VF is chosen that looks for calls within the loop that will be vectorized with a vector library call (given the chosen VF).

Although the pass is not particularly expensive, I added MayHaveVectorLibCall() to the legalization phase. This records if a call is seen for which a vector library call is available. However, as stated above we can't tell if it will be used. However, the flag is used to guard the extra pass added after the VF is chosen. For most cases this is sufficient to avoid running the pass.

This patch tries to fix improve the SVML calling convention. https://reviews.llvm.org/D47188 Maybe it will help this code?

In D48193#1134868, @craig.topper wrote:

This patch tries to fix improve the SVML calling convention. https://reviews.llvm.org/D47188 Maybe it will help this code?

Hi Craig,

Thanks for the link.

I had an idea for an alternative approach before sending the patch. This involved changing the register usage calculation to record the number of live values at the point of a call. Then, if we knew how many vector registers are preserved across the call, we could estimate register pressure, and potentially allow interleaving. For example, if there was 1 live value at the call, and 4 registers are preserved, we could allow an IC of 4 (1*4 means there would be 4 live registers after interleaving minus 1 for the value dead after the call).

The problem was finding out how many registers are preserved. The TargetTransformInfo pass exposes codegen information to IR-level passes. So for example, it provides to the loop vectorizer the number of registers (scalar or vector). However, this is just a simple target specific number. In contrast, the number of preserved registers depends on the calling-convention/word-size/instruction-set, etc. A vector-library call could use any calling-convention, and as far as I can see, there's nothing to prevent anybody from providing an SVML-like library on say, ARM, so this would also need to be implemented for all targets.

Of course, this sort of information is needed by the register allocator. TargetRegisterInfo provides an interface to find out information about the target registers, e.g. getCalleeSavedRegs() and getCallPreservedMask(). The TargetRegisterInfo is normally obtained via the subtarget attached to the machine function (this subtarget is created during codegen prepare). However, the TargetTransformInfo also has a subtarget object (as part of the TTIImpl), which means the TargetRegisterInfo could potentially be queried by the loop-vectorizer. However, both getCalleeSavedRegs() and getCallPreservedMask() take a MachineFunction pointer (which obviously doesn't exist when the loop-vectorizer is ran). The functions are also much more low-level than we require (we would need to convert the return into a number, based on register class, etc.).

D47188 is interesting for two reasons. Firstly it provides an explicit calling-convention for SVML. Secondly, only an X86 implementation of the calling-convention is provided. So, if this were to land, implementing the number of preserved registers becomes trivial. However, on re-reading the previous comments, there's a reluctance to only handle vector-library calls (unfortunately handling all calls is problematic, as an intrinsic call may end up as a sequence of instructions, or it may be lowered back into scalar calls to libm). Also, is it true that the SVML framework is X86-only?

So I'm still rather stuck as to what to do for the preserved registers. I don't want to go to the fuss of providing a full implementation for every target (duplicating the logic in getCalleeSavedRegs/getCallPreservedMask) if it isn't necessary. If people think the approach outlined above sounds promising, I could provide an initial patch that just handled the default CCs on X86?

Thanks,
Rob.

RKSimon added a reviewer: RKSimon.Aug 14 2018, 10:06 AM

rob.lougher mentioned this in D50798: [LoopVectorizer] Take into account call register pressure when selecting interleave count.Aug 15 2018, 12:41 PM

I've created a follow up review D50798.

Revision Contents

Path

Size

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

8 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

49 lines

LoopVectorize.cpp

28 lines

test/

Transforms/

LoopVectorize/

X86/

interleaving-veclib-call.ll

101 lines

Diff 151397

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 316 Lines • ▼ Show 20 Lines	public:
bool isMaskRequired(const Instruction *I) { return (MaskedOp.count(I) != 0); }		bool isMaskRequired(const Instruction *I) { return (MaskedOp.count(I) != 0); }

unsigned getNumStores() const { return LAI->getNumStores(); }		unsigned getNumStores() const { return LAI->getNumStores(); }
unsigned getNumLoads() const { return LAI->getNumLoads(); }		unsigned getNumLoads() const { return LAI->getNumLoads(); }

// Returns true if the NoNaN attribute is set on the function.		// Returns true if the NoNaN attribute is set on the function.
bool hasFunNoNaNAttr() const { return HasFunNoNaNAttr; }		bool hasFunNoNaNAttr() const { return HasFunNoNaNAttr; }

		// Returns true if the loop contains a call that may be vectorized
		// with a vector version of the library call.
		bool mayHaveVectorLibCall() const { return MayHaveVectorLibCall; }

private:		private:
/// Return true if the pre-header, exiting and latch blocks of \p Lp and all		/// Return true if the pre-header, exiting and latch blocks of \p Lp and all
/// its nested loops are considered legal for vectorization. These legal		/// its nested loops are considered legal for vectorization. These legal
/// checks are common for inner and outer loop vectorization.		/// checks are common for inner and outer loop vectorization.
/// Temporarily taking UseVPlanNativePath parameter. If true, take		/// Temporarily taking UseVPlanNativePath parameter. If true, take
/// the new code path being implemented for outer loop vectorization		/// the new code path being implemented for outer loop vectorization
/// (should be functional for inner loop vectorization) based on VPlan.		/// (should be functional for inner loop vectorization) based on VPlan.
/// If false, good old LV code.		/// If false, good old LV code.
▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	private:

/// The assumption cache analysis is used to compute the minimum type size in		/// The assumption cache analysis is used to compute the minimum type size in
/// which a reduction can be computed.		/// which a reduction can be computed.
AssumptionCache *AC;		AssumptionCache *AC;

/// While vectorizing these instructions we have to generate a		/// While vectorizing these instructions we have to generate a
/// call to the appropriate masked intrinsic		/// call to the appropriate masked intrinsic
SmallPtrSet<const Instruction *, 8> MaskedOp;		SmallPtrSet<const Instruction *, 8> MaskedOp;

		// Does the loop contain a call that may be vectorized with a vector version
		// of the library call.
		bool MayHaveVectorLibCall = false;
};		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H		#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 652 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
return false;		return false;
} // end of PHI handling		} // end of PHI handling

// We handle calls that:		// We handle calls that:
// * Are debug info intrinsics.		// * Are debug info intrinsics.
// * Have a mapping to an IR intrinsic.		// * Have a mapping to an IR intrinsic.
// * Have a vector version available.		// * Have a vector version available.
auto *CI = dyn_cast<CallInst>(&I);		auto *CI = dyn_cast<CallInst>(&I);
if (CI && !getVectorIntrinsicIDForCall(CI, TLI) &&		if (CI) {
!isa<DbgInfoIntrinsic>(CI) &&		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
!(CI->getCalledFunction() && TLI &&		bool VectorAvail = CI->getCalledFunction() && TLI &&
TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) {		TLI->isFunctionVectorizable(CI->getCalledFunction()->getName());
		if (!ID && !VectorAvail && !isa<DbgInfoIntrinsic>(CI)) {
ORE->emit(createMissedAnalysis("CantVectorizeCall", CI)		ORE->emit(createMissedAnalysis("CantVectorizeCall", CI)
<< "call instruction cannot be vectorized");		<< "call instruction cannot be vectorized");
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");		dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");
return false;		return false;
}		}

// Intrinsics such as powi,cttz and ctlz are legal to vectorize if the		// Intrinsics such as powi,cttz and ctlz are legal to vectorize if the
// second argument is the same (i.e. loop invariant)		// second argument is the same (i.e. loop invariant)
if (CI && hasVectorInstrinsicScalarOpd(		if (hasVectorInstrinsicScalarOpd(ID, 1)) {
getVectorIntrinsicIDForCall(CI, TLI), 1)) {
auto *SE = PSE.getSE();		auto *SE = PSE.getSE();
if (!SE->isLoopInvariant(PSE.getSCEV(CI->getOperand(1)), TheLoop)) {		if (!SE->isLoopInvariant(PSE.getSCEV(CI->getOperand(1)), TheLoop)) {
ORE->emit(createMissedAnalysis("CantVectorizeIntrinsic", CI)		ORE->emit(createMissedAnalysis("CantVectorizeIntrinsic", CI)
<< "intrinsic instruction cannot be vectorized");		<< "intrinsic instruction cannot be vectorized");
LLVM_DEBUG(dbgs()		LLVM_DEBUG(dbgs()
<< "LV: Found unvectorizable intrinsic " << *CI << "\n");		<< "LV: Found unvectorizable intrinsic " << *CI << "\n");
return false;		return false;
}		}
}		}

		// If a vector library call is available, we can only say the vectorized
		// loop "may" contain a call to it, as the decision depends on the
		// selected vectorization factor.
		if (VectorAvail)
		MayHaveVectorLibCall = true;
		}

// Check that the instruction return type is vectorizable.		// Check that the instruction return type is vectorizable.
// Also, we can't vectorize extractelement instructions.		// Also, we can't vectorize extractelement instructions.
if ((!VectorType::isValidElementType(I.getType()) &&		if ((!VectorType::isValidElementType(I.getType()) &&
!I.getType()->isVoidTy()) \|\|		!I.getType()->isVoidTy()) \|\|
isa<ExtractElementInst>(I)) {		isa<ExtractElementInst>(I)) {
ORE->emit(createMissedAnalysis("CantVectorizeInstructionReturnType", &I)		ORE->emit(createMissedAnalysis("CantVectorizeInstructionReturnType", &I)
<< "instruction return type cannot be vectorized");		<< "instruction return type cannot be vectorized");
LLVM_DEBUG(dbgs() << "LV: Found unvectorizable type.\n");		LLVM_DEBUG(dbgs() << "LV: Found unvectorizable type.\n");
▲ Show 20 Lines • Show All 379 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,488 Lines • ▼ Show 20 Lines	private:
/// Returns whether the instruction is a load or store and will be a emitted		/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.		/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);		bool isConsecutiveLoadOrStore(Instruction *I);

/// Returns true if an artificially high cost for emulated masked memrefs		/// Returns true if an artificially high cost for emulated masked memrefs
/// should be used.		/// should be used.
bool useEmulatedMaskMemRefHack(Instruction *I);		bool useEmulatedMaskMemRefHack(Instruction *I);

		// Returns true if the loop vectorized with a factor of \p VF would contain a
		// call to a vector library function.
		bool containsVectorLibCall(unsigned VF);

/// Create an analysis remark that explains why vectorization failed		/// Create an analysis remark that explains why vectorization failed
///		///
/// \p RemarkName is the identifier for the remark. \return the remark object		/// \p RemarkName is the identifier for the remark. \return the remark object
/// that can be streamed to.		/// that can be streamed to.
OptimizationRemarkAnalysis createMissedAnalysis(StringRef RemarkName) {		OptimizationRemarkAnalysis createMissedAnalysis(StringRef RemarkName) {
return createLVMissedAnalysis(Hints->vectorizeAnalysisPassName(),		return createLVMissedAnalysis(Hints->vectorizeAnalysisPassName(),
RemarkName, TheLoop);		RemarkName, TheLoop);
}		}
▲ Show 20 Lines • Show All 3,641 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
if (Legal->getMaxSafeDepDistBytes() != -1U)		if (Legal->getMaxSafeDepDistBytes() != -1U)
return 1;		return 1;

// Do not interleave loops with a relatively small trip count.		// Do not interleave loops with a relatively small trip count.
unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);		unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
if (TC > 1 && TC < TinyTripCountInterleaveThreshold)		if (TC > 1 && TC < TinyTripCountInterleaveThreshold)
return 1;		return 1;

		// Do not interleave if the vectorized loop will contain a call to a vector
		// library function, as the function call will cause the registers for
		// the parallel instances to be spilled.
		if (Legal->mayHaveVectorLibCall() && containsVectorLibCall(VF))
		return 1;

unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1);		unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1);
LLVM_DEBUG(dbgs() << "LV: The target has " << TargetNumRegisters		LLVM_DEBUG(dbgs() << "LV: The target has " << TargetNumRegisters
<< " registers\n");		<< " registers\n");

if (VF == 1) {		if (VF == 1) {
if (ForceTargetNumScalarRegs.getNumOccurrences() > 0)		if (ForceTargetNumScalarRegs.getNumOccurrences() > 0)
TargetNumRegisters = ForceTargetNumScalarRegs;		TargetNumRegisters = ForceTargetNumScalarRegs;
} else {		} else {
▲ Show 20 Lines • Show All 428 Lines • ▼ Show 20 Lines	while (!Worklist.empty()) {
// of the instruction costs more, and scalarizing would be beneficial.		// of the instruction costs more, and scalarizing would be beneficial.
Discount += VectorCost - ScalarCost;		Discount += VectorCost - ScalarCost;
ScalarCosts[I] = ScalarCost;		ScalarCosts[I] = ScalarCost;
}		}

return Discount;		return Discount;
}		}

		bool LoopVectorizationCostModel::containsVectorLibCall(unsigned VF) {
		// Given a vectorization factor VF, this function looks for calls that would
		// be vectorized with a vector version of the library call.
		for (BasicBlock *BB : TheLoop->blocks())
		for (Instruction &I : BB->instructionsWithoutDebug())
		if (auto *CI = dyn_cast<CallInst>(&I)) {
		bool NeedToScalarize;
		unsigned CallCost =
		getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);
		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
		if (!NeedToScalarize &&
		(!ID \|\| getVectorIntrinsicCost(CI, VF, TTI, TLI) > CallCost))
		return true;
		}

		return false;
		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::expectedCost(unsigned VF) {		LoopVectorizationCostModel::expectedCost(unsigned VF) {
VectorizationCostTy Cost;		VectorizationCostTy Cost;

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
VectorizationCostTy BlockCost;		VectorizationCostTy BlockCost;

▲ Show 20 Lines • Show All 2,056 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/interleaving-veclib-call.ll

				; RUN: opt -S -mtriple=x86_64-unknown-linux -mcpu=btver2 -vector-library=SVML -loop-vectorize < %s \| FileCheck %s

				; This test checks that when a call is vectorized with a vector library call
				; the interleave count used is 1 (i.e. the call appears only once). As loops
				; with reductions are treated specially by the cost model we also test this
				; case. Finally, a test is included that checks that no restriction of the
				; interleave count is done when the call is not vectorized to a vector library
				; call.

				; CHECK-LABEL: test
				; CHECK: call <8 x float> @__svml_sinf8
				; CHECK-NOT: call <8 x float> @__svml_sinf8

				define void @sinf-test(float* nocapture readonly %a, float* nocapture %b, i32 %n) {
				entry:
				%cmp7 = icmp sgt i32 %n, 0
				br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%call = tail call float @sinf(float %0)
				%arrayidx2 = getelementptr inbounds float, float* %b, i64 %indvars.iv
				store float %call, float* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				declare dso_local float @sinf(float) local_unnamed_addr

				; CHECK-LABEL: sinf-reduc-test
				; CHECK: call fast <8 x float> @__svml_sinf8
				; CHECK-NOT: call fast <8 x float> @__svml_sinf8

				define float @sinf-reduc-test(float* nocapture readonly %a, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				%s.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
				ret float %s.0.lcssa

				for.body: ; preds = %for.body, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%s.07 = phi float [ 0.000000e+00, %for.body.preheader ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%1 = tail call fast float @llvm.sin.f32(float %0)
				%add = fadd fast float %1, %s.07
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				declare float @llvm.sin.f32(float)

				; CHECK-LABEL: ceilf-test
				; CHECK: call <8 x float> @llvm.ceil.v8f32
				; CHECK: call <8 x float> @llvm.ceil.v8f32
				; CHECK: call <8 x float> @llvm.ceil.v8f32
				; CHECK: call <8 x float> @llvm.ceil.v8f32

				define void @ceilf-test(float* nocapture readonly %a, float* nocapture %b, i32 %n) {
				entry:
				%cmp7 = icmp sgt i32 %n, 0
				br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%1 = tail call float @llvm.ceil.f32(float %0)
				%arrayidx2 = getelementptr inbounds float, float* %b, i64 %indvars.iv
				store float %1, float* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				declare float @llvm.ceil.f32(float)

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Use an interleave count of 1 when using a vector library callNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 151397

include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/X86/interleaving-veclib-call.ll

[LoopVectorizer] Use an interleave count of 1 when using a vector library call
Needs ReviewPublic