This is an archive of the discontinued LLVM Phabricator instance.

Improve LoopVectorizers estimation of scalar loops by predicting LSR behaviour
Needs ReviewPublic

Authored by jonpa on Apr 18 2017, 7:38 AM.

Download Raw Diff

Details

Reviewers

rengolin
hfinkel
mssimpso
mkuper
delena
qcolombet
craig.topper
sanjoy

Summary

I have experimented with collecting a set of instructions that LSR is likely to eliminate in the scalar version of the loop. This is e.g. an add of a PHI and a constant.

At this point, the patch is showing a general improvement of at least %5 in scalar estimates. The collectLikelyLSRed() can likely be improved.

At this point, I wonder what the response to this is? Is this a good or bad idea?

How much would we want a general improvement vs avoiding worst case degradation of estimates?

Diff Detail

Event Timeline

jonpa created this revision.Apr 18 2017, 7:38 AM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptApr 18 2017, 7:38 AM

Hi Jonas,

As much as this helps, I think duplicating the LSR logic here, even if a very limited subset of it, is problematic.

If assumptions change with time, you may start having increasingly different results and not detect them until the more pathological cases start to explode.

In cases like this, in the past, we discussed splitting the analysis pass from the execution pass on the targeted transformation. IIRC, this is how the loop analysis passes were born.

Maybe, if you move the LSR analysis code to the generic loop pass side, and both LSR and LV use the same analysis, you can work around the issue without duplicating anything.

Makes sense?

cheers,
--renato

In D32166#743732, @rengolin wrote:

Hi Jonas,

As much as this helps, I think duplicating the LSR logic here, even if a very limited subset of it, is problematic.

If assumptions change with time, you may start having increasingly different results and not detect them until the more pathological cases start to explode.

In cases like this, in the past, we discussed splitting the analysis pass from the execution pass on the targeted transformation. IIRC, this is how the loop analysis passes were born.

Maybe, if you move the LSR analysis code to the generic loop pass side, and both LSR and LV use the same analysis, you can work around the issue without duplicating anything.

Makes sense?

cheers,
--renato

Hi Renato,

thanks for your opinion!

Ideally the LSR should handle vectorized addresses as well, and if it could, there would not be this difference against the scalar loops. It however can't, and that's why this patch improves on vectorizer decions by letting the scalar loop have a cheaper cost in these cases.

Is LSR going to do this anytime soon? Or is it an option to run LSR (also?) before the Loop vectorizer? I have seen many cases where the scalar loop is much smaller just because the vectorizer generates a vector add, vector shift and what not of the address, while LSR removes those instructions entirely in the scalar loop. In this sense, isn't the loop vectorizer run too early?

I am not sure if it's worth the effort to factor out the LSR analysis just to know how much better it is on scalar loops... It would be better to fix the vectorized loops instead. So I guess, this patch should actually just hopefully be temporary if used...

sanjoy added a subscriber: sanjoy.May 7 2017, 2:25 PM

In D32166#745689, @jonpa wrote:

In this sense, isn't the loop vectorizer run too early?

That's a good question, one that may be found our by adding it to just before the loop vectoriser and see how compile time increases proportional to the run-time performance benefits of standard benchmarks.

I'm adding some more people to chime in, just in case they have seen this before and have a plan we're not aware of. :)

cheers,
--renato

In this sense, isn't the loop vectorizer run too early?

That's a good question, one that may be found our by adding it to just before the loop vectoriser and see how compile time increases proportional to the run-time performance benefits of standard benchmarks.

I tried this - adding LSR just before loop vectorizer, and found that it did indeed affect a few benchmarks noticeably, with mixed results.

In D32166#768530, @jonpa wrote:

I tried this - adding LSR just before loop vectorizer, and found that it did indeed affect a few benchmarks noticeably, with mixed results.

For both compile time and run time performance?

In D32166#768589, @rengolin wrote:

In D32166#768530, @jonpa wrote:

I tried this - adding LSR just before loop vectorizer, and found that it did indeed affect a few benchmarks noticeably, with mixed results.

For both compile time and run time performance?

I have just looked at performance. It was a while since I tried to evaluate the compile time impact of a single pass. How would you do this?

In D32166#768590, @jonpa wrote:

I have just looked at performance. It was a while since I tried to evaluate the compile time impact of a single pass. How would you do this?

Well, if the whole compile time doesn't increase noticeably on a good number of programs and benchmarks (I'm expecting it won't), then it should be mostly fine.

One way to track compile time is using LNT. Put up a server, run vanilla, submit the results, run with the change, submit the results, compare. (http://llvm.org/docs/lnt/quickstart.html)

If the results are mostly fine, then it'd be just a matter or understanding why it was where it was and make sure everyone agrees with the move (or addition).

I'm adding a few people that have recently worked in the LSR as well as some vectoriser people.

I have seen a few comments in the LV that expect LSR to pass after it, so it might not be completely advisable to *move*, but passing before and after might have a noticeable impact.

cheers,
--renato

sanjoy resigned from this revision.Jan 29 2022, 5:34 PM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

63 lines

Diff 95567

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,887 Lines • ▼ Show 20 Lines	public:

/// \return Returns information about the register usages of the loop for the		/// \return Returns information about the register usages of the loop for the
/// given vectorization factors.		/// given vectorization factors.
SmallVector<RegisterUsage, 8> calculateRegisterUsage(ArrayRef<unsigned> VFs);		SmallVector<RegisterUsage, 8> calculateRegisterUsage(ArrayRef<unsigned> VFs);

/// Collect values we want to ignore in the cost model.		/// Collect values we want to ignore in the cost model.
void collectValuesToIgnore();		void collectValuesToIgnore();

		/// Find the set of instructions that is likely to be eliminated by Loop
		/// Strength Reduction.
		void collectLikelyLSRed();

/// \returns The smallest bitwidth each instruction can be represented with.		/// \returns The smallest bitwidth each instruction can be represented with.
/// The vector equivalents of these instructions should be truncated to this		/// The vector equivalents of these instructions should be truncated to this
/// type.		/// type.
const MapVector<Instruction *, uint64_t> &getMinimalBitwidths() const {		const MapVector<Instruction *, uint64_t> &getMinimalBitwidths() const {
return MinBWs;		return MinBWs;
}		}

/// \returns True if it is more profitable to scalarize instruction \p I for		/// \returns True if it is more profitable to scalarize instruction \p I for
▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	private:
/// The cost calculation for Load instruction \p I with uniform pointer -		/// The cost calculation for Load instruction \p I with uniform pointer -
/// scalar load + broadcast.		/// scalar load + broadcast.
unsigned getUniformMemOpCost(Instruction *I, unsigned VF);		unsigned getUniformMemOpCost(Instruction *I, unsigned VF);

/// Returns whether the instruction is a load or store and will be a emitted		/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.		/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);		bool isConsecutiveLoadOrStore(Instruction *I);

		/// A set of instructions that is likely to be eliminated by Loop Strength
		/// Reduction.
		SmallPtrSet<Instruction*, 4> LikelyLSRed;

/// Create an analysis remark that explains why vectorization failed		/// Create an analysis remark that explains why vectorization failed
///		///
/// \p RemarkName is the identifier for the remark. \return the remark object		/// \p RemarkName is the identifier for the remark. \return the remark object
/// that can be streamed to.		/// that can be streamed to.
OptimizationRemarkAnalysis createMissedAnalysis(StringRef RemarkName) {		OptimizationRemarkAnalysis createMissedAnalysis(StringRef RemarkName) {
return ::createMissedAnalysis(Hints->vectorizeAnalysisPassName(),		return ::createMissedAnalysis(Hints->vectorizeAnalysisPassName(),
RemarkName, TheLoop);		RemarkName, TheLoop);
}		}
▲ Show 20 Lines • Show All 5,141 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
// If we know that this instruction will remain uniform, check the cost of		// If we know that this instruction will remain uniform, check the cost of
// the scalar version.		// the scalar version.
if (isUniformAfterVectorization(I, VF))		if (isUniformAfterVectorization(I, VF))
VF = 1;		VF = 1;

if (VF > 1 && isProfitableToScalarize(I, VF))		if (VF > 1 && isProfitableToScalarize(I, VF))
return VectorizationCostTy(InstsToScalarize[VF][I], false);		return VectorizationCostTy(InstsToScalarize[VF][I], false);

		if (VF == 1 && LikelyLSRed.count(I))
		return VectorizationCostTy(0, false);

Type *VectorTy;		Type *VectorTy;
unsigned C = getInstructionCost(I, VF, VectorTy);		unsigned C = getInstructionCost(I, VF, VectorTy);

bool TypeNotScalarized =		bool TypeNotScalarized =
VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;		VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;
return VectorizationCostTy(C, TypeNotScalarized);		return VectorizationCostTy(C, TypeNotScalarized);
}		}

▲ Show 20 Lines • Show All 320 Lines • ▼ Show 20 Lines	bool LoopVectorizationCostModel::isConsecutiveLoadOrStore(Instruction *Inst) {

// Check if the pointer operand of a load or store instruction is		// Check if the pointer operand of a load or store instruction is
// consecutive.		// consecutive.
if (auto *Ptr = getPointerOperand(Inst))		if (auto *Ptr = getPointerOperand(Inst))
return Legal->isConsecutivePtr(Ptr);		return Legal->isConsecutivePtr(Ptr);
return false;		return false;
}		}

		void LoopVectorizationCostModel::collectLikelyLSRed() {
		BasicBlock *Header = TheLoop->getHeader();

		std::vector<Instruction*> worklist;
		for (BasicBlock::iterator PHI = Header->begin(); isa<PHINode>(PHI); ++PHI)
		worklist.push_back(&*PHI);

		while (!worklist.empty()) {
		Instruction *PhiChain = worklist.back();
		worklist.pop_back();

		for (User *U : PhiChain->users()) {
		Instruction *UI = cast<Instruction>(U);

		// Find a user that in turn has only one use,
		if (!UI->hasOneUse() \|\|
		(UI->getOpcode() != Instruction::Add &&
		UI->getOpcode() != Instruction::Sub &&
		UI->getOpcode() != Instruction::Mul &&
		UI->getOpcode() != Instruction::Trunc))
		continue;

		// that is not a compare or a PHI
		Instruction *UUI = UI->user_back();
		if (isa<CmpInst>(UUI) \|\| isa<PHINode>(UUI))
		continue;

		// and only uses a loop invariant value or constant, in addition to the
		// PHI.
		bool UsingLoopValue = false;
		for (Value *Op : UI->operands()) {
		if (Instruction *OpI = dyn_cast<Instruction>(Op)) {
		if (OpI == PhiChain \|\| !TheLoop->contains(OpI) \|\|
		LikelyLSRed.count(OpI))
		continue;
		}
		if (isa<Constant>(Op))
		continue;

		UsingLoopValue = true;
		break;
		}

		if (!UsingLoopValue) {
		worklist.push_back(UI);
		LikelyLSRed.insert(UI);
		}
		}
		}
		}

void LoopVectorizationCostModel::collectValuesToIgnore() {		void LoopVectorizationCostModel::collectValuesToIgnore() {
// Ignore ephemeral values.		// Ignore ephemeral values.
CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore);		CodeMetrics::collectEphemeralValues(TheLoop, AC, ValuesToIgnore);

// Ignore type-promoting instructions we identified during reduction		// Ignore type-promoting instructions we identified during reduction
// detection.		// detection.
for (auto &Reduction : *Legal->getReductionVars()) {		for (auto &Reduction : *Legal->getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
▲ Show 20 Lines • Show All 205 Lines • ▼ Show 20 Lines	if (Hints.isPotentiallyUnsafe() &&
emitMissedWarning(F, L, Hints, ORE);		emitMissedWarning(F, L, Hints, ORE);
return false;		return false;
}		}

// Use the cost model.		// Use the cost model.
LoopVectorizationCostModel CM(L, PSE, LI, &LVL, *TTI, TLI, DB, AC, ORE, F,		LoopVectorizationCostModel CM(L, PSE, LI, &LVL, *TTI, TLI, DB, AC, ORE, F,
&Hints);		&Hints);
CM.collectValuesToIgnore();		CM.collectValuesToIgnore();
		CM.collectLikelyLSRed();

// Use the planner for vectorization.		// Use the planner for vectorization.
LoopVectorizationPlanner LVP(CM);		LoopVectorizationPlanner LVP(CM);

// Get user vectorization factor.		// Get user vectorization factor.
unsigned UserVF = Hints.getWidth();		unsigned UserVF = Hints.getWidth();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
▲ Show 20 Lines • Show All 264 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Improve LoopVectorizers estimation of scalar loops by predicting LSR behaviourNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 95567

lib/Transforms/Vectorize/LoopVectorize.cpp

Improve LoopVectorizers estimation of scalar loops by predicting LSR behaviour
Needs ReviewPublic