This is an archive of the discontinued LLVM Phabricator instance.

Optimize unrolled reductions in LoopStrengthReduce
Needs ReviewPublic

Authored by ohsallen on Jan 22 2015, 9:34 AM.

Download Raw Diff

Details

Reviewers

Summary

Break dependencies between unrolled iterations of reductions in loops. This should be particularly effective for superscalar targets. For a kernel similar to the one below, we get 2.5x speedup on POWER8 when the unroll factor is 3.

// Original reduction.
for (int i = 0; i < n; ++i)
    r += arr[i];

// Unrolled reduction.
for (int i = 0; i < n; i += 2) {
    r += arr[i];
    r += arr[i+1];
}

// Optimized reduction
float r_0 = 0;
for (int i = 0; i < n; i += 2) {
    r += arr[i];
    r_0 += arr[i+1];
}
r += r_0;

Diff Detail

Event Timeline

ohsallen updated this revision to Diff 18617.Jan 22 2015, 9:34 AM

ohsallen retitled this revision from to Optimize unrolled reductions in LoopStrengthReduce.

ohsallen updated this object.

ohsallen edited the test plan for this revision. (Show Details)

ohsallen added a reviewer: hfinkel.

ohsallen added a subscriber: Unknown Object (MLST).

Hi Olivier,

AFAIK, most, if not all, heuristics in LSR are careful not to increase register pressure. You can see that, in particular, when we rate the formulae.
On the other hand, your optimization increases the register pressure for the whole loop. This may not be a good idea in general.

Ultimately, I would like we have some kind of register pressure estimation to decide whether or not we should perform the transformation for a given loop.
Short term, it should, in my opinion, be at least parameterized on the target, i.e., we should have a target hook to decide whether or not we want to perform this optimization.

Side question, how does this patch impact the performances on the llvm test-suite?

Thanks,
-Quentin

Quentin,

Thanks for the feedback. This makes sense and I agree. When the unrolling factor is N, N-1 additional registers are live in the loop range, so typically we could have some limit (depending on the target and/or register pressure as you suggest) and partially apply the optimization on the loop in some cases.

I'll look into the performance of the test-suite.

Olivier

In D7128#112267, @qcolombet wrote:

Hi Olivier,

AFAIK, most, if not all, heuristics in LSR are careful not to increase register pressure. You can see that, in particular, when we rate the formulae.
On the other hand, your optimization increases the register pressure for the whole loop. This may not be a good idea in general.

Ultimately, I would like we have some kind of register pressure estimation to decide whether or not we should perform the transformation for a given loop.
Short term, it should, in my opinion, be at least parameterized on the target, i.e., we should have a target hook to decide whether or not we want to perform this optimization.

Side question, how does this patch impact the performances on the llvm test-suite?

Thanks,
-Quentin

I agree, this needs a register-pressure threshold. Also, I thought that the loop vectorizer would also perform this transformation as part of its interleaved unrolling capability. Does it not? If not, perhaps it really belongs there (and the vectorizer already has register pressure heuristics)?

In D7128#112583, @hfinkel wrote:

I agree, this needs a register-pressure threshold. Also, I thought that the loop vectorizer would also perform this transformation as part of its interleaved unrolling capability. Does it not? If not, perhaps it really belongs there (and the vectorizer already has register pressure heuristics)?

Hi Hal,

The loop vectorizer performs a similar transformation indeed, but does not allow to break dependencies between (already) unrolled iterations of a loop. For instance, consider the following:

// Original loop.
for (int i = 0; i < n; i++) 
    for (int j = 0; j < 3; j++)
        r += arr[i][j];

// After unrolling pass.
for (int i = 0; i < n; i++)  {
    r += arr[i][0];
    r += arr[i][1];
    r += arr[i][2];
}

// After vectorization pass.
for (int i = 0; i < n; i += 2)  {
    r += arr[i][0];
    r_0 += arr[i+1][0];
    r += arr[i][1];
    r_0 += arr[i+1][1];
    r += arr[i][2];
    r_0 += arr[i+1][2];
}
r += r_0;

// After strength reduction pass with changes.
for (int i = 0; i < n; i += 2)  {
    r += arr[i][0];
    r_0 += arr[i+1][0];
    r_1 += arr[i][1];
    r_2 += arr[i+1][1];
    r_3+= arr[i][2];
    r_4 += arr[i+1][2];
}
r += r_0 + r_1 + r_2 + r_3 + r_4;

The interleaved unrolling in the loop vectorizer seem to add on top of the former unrolling pass. There are two separate dependency chains after vectorization, but the code runs faster on POWER8 with three chains (and potentially even faster with up to six chains). By breaking dependencies (while checking register pressure) later in strength reduction, we can achieve better performance. It's not clear to me whether the loop vectorizer can be changed to get this behavior, I'll have to investigate.

Thanks,

Olivier

As explained in my last email, the regular loop unroller (LoopUnroll.cpp) does not break dependencies in reduction chains. Only the loop vectorizer/unroller (LoopVectorize.cpp) does. Problem with the latter is that, the code which breaks dependencies and the one which performs unrolling is tightly coupled. So, if the loop was already unrolled by the first unrolling pass, then reductions aren't optimized by the loop vectorizer/unroller.

To reuse the existing code in LoopVectorize.cpp (instead of my patch), we could choose in LoopUnroll.cpp to not unroll loops which contain reductions. Then the vectorizer would see the opportunity to unroll and perform the optimization. There would be a few cosmetic changes to do in the loop vectorizer, which I can detail if you think this is reasonable.

Olivier

In D7128#118028, @ohsallen wrote:

As explained in my last email, the regular loop unroller (LoopUnroll.cpp) does not break dependencies in reduction chains. Only the loop vectorizer/unroller (LoopVectorize.cpp) does. Problem with the latter is that, the code which breaks dependencies and the one which performs unrolling is tightly coupled. So, if the loop was already unrolled by the first unrolling pass, then reductions aren't optimized by the loop vectorizer/unroller.

I don't understand the problem you're trying to highlight. The loop unroller is run in two places within the standard optimization pipeline. The first place is 'early', within the inliner-driven CGSCC pass manager. When run early, it does *full* unrolling only. It is also run 'late', after the loop vectorizer, when it might also do target-directed partial unrolling. But this is after the loop vectorizer runs, so there should be no conflict.

To reuse the existing code in LoopVectorize.cpp (instead of my patch), we could choose in LoopUnroll.cpp to not unroll loops which contain reductions. Then the vectorizer would see the opportunity to unroll and perform the optimization. There would be a few cosmetic changes to do in the loop vectorizer, which I can detail if you think this is reasonable.

Yes, I think that enhancing the loop vectorizer to do this is reasonable.

Olivier

Hal,

In D7128#118048, @hfinkel wrote:

I don't understand the problem you're trying to highlight. The loop unroller is run in two places within the standard optimization pipeline. The first place is 'early', within the inliner-driven CGSCC pass manager. When run early, it does *full* unrolling only. It is also run 'late', after the loop vectorizer, when it might also do target-directed partial unrolling. But this is after the loop vectorizer runs, so there should be no conflict.

Thanks for the clarification. What I meant was that it seems harder to change the vectorizer to break dependencies of loops which are already unrolled, rather than not unrolling them in the 'early' pass and let the vectorizer and 'late' unroller do their job. I'll propose a patch that implements the second solution.

Olivier

Here is a simpler solution: when the inner loop contains reductions and gets unrolled, the loop vectorizer should unroll the outer loop and break dependencies. For the code below, it does not happen because the loop isn't considered 'small' anymore. Attached is a patch which changes the heuristics in the vectorizer unroller, and gives a 2x speedup for this code on POWER8. If it LGTY, I will add it as a regression test.

for(int i=0; i<n; i++) {
  for(int i_c=0; i_c<3; i_c++) {
    _Complex __attribute__ ((aligned (8))) at = a[i][i_c];
    sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at)));
  }
}

patch.diff697 BDownload

In D7128#119930, @ohsallen wrote:
Here is a simpler solution: when the inner loop contains reductions and gets unrolled, the loop vectorizer should unroll the outer loop and break dependencies. For the code below, it does not happen because the loop isn't considered 'small' anymore. Attached is a patch which changes the heuristics in the vectorizer unroller, and gives a 2x speedup for this code on POWER8. If it LGTY, I will add it as a regression test.
for(int i=0; i<n; i++) {
  for(int i_c=0; i_c<3; i_c++) {
    _Complex __attribute__ ((aligned (8))) at = a[i][i_c];
    sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at)));
  }
}
patch.diff697 BDownload

Okay, interesting. Please propose this as a separate patch. We'll need to run tests on other platforms too (or also otherwise restrict it).

FWIW I'm not such a huge fan of using register pressure heuristics in the
middle end, it has a tendency to hide code gen deficiencies or be
suboptimal in cases where we could do better if we did the optimization and
let code generation fold things together/MachineLICM/etc. I understand
where the desires come from, but if possible I like to optimize as hard as
we can in the middle end and clean it up using machine optimizations.

Yes, I know this isn't always practical, but...

-eric

Eric,

Interesting point. I could eventually investigate that if I were to apply the current patch to LoopStrengthReduce. As Hal and I decided to use the existing functionality in the loop vectorizer, we will rely on the existing heuristics.

Thanks,
Olivier

FWIW I'm not such a huge fan of using register pressure heuristics in the middle end

FWIW, I agree, but this is not what LSR currently does and since the backends do not expect that yet, I would prefer moving with cautious.

Cautious is fine, but adding more register pressure heuristics to the
middle end just gets us to a point where everything is poorly done via
heuristics.

-eric

Original Message -----

From: "Eric Christopher" <echristo@gmail.com>
To: ohsallen@us.ibm.com, hfinkel@anl.gov
Cc: llvm-commits@cs.uiuc.edu
Sent: Tuesday, February 10, 2015 3:22:43 PM
Subject: Re: [PATCH] Optimize unrolled reductions in LoopStrengthReduce

Cautious is fine, but adding more register pressure heuristics to the
middle end just gets us to a point where everything is poorly done
via
heuristics.

FWIW, this is why I asked that this be moved to the vectorizer, where we already have a register-pressure heuristic, and I don't want to add any more of these than necessary. That is what is being worked on now.

-Hal

-eric

http://reviews.llvm.org/D7128

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

LoopStrengthReduce.cpp

238 lines

test/

Transforms/

LoopStrengthReduce/

X86/

ivchain-X86.ll

2 lines

unrolled-reduction.ll

174 lines

Diff 18617

lib/Transforms/Scalar/LoopStrengthReduce.cpp

Show First 20 Lines • Show All 1,691 Lines • ▼ Show 20 Lines	class LSRInstance {

/// IVIncSet - IV users that belong to profitable IVChains.		/// IVIncSet - IV users that belong to profitable IVChains.
SmallPtrSet<Use*, MaxChains> IVIncSet;		SmallPtrSet<Use*, MaxChains> IVIncSet;

void OptimizeShadowIV();		void OptimizeShadowIV();
bool FindIVUserForCond(ICmpInst Cond, IVStrideUse &CondUse);		bool FindIVUserForCond(ICmpInst Cond, IVStrideUse &CondUse);
ICmpInst OptimizeMax(ICmpInst Cond, IVStrideUse* &CondUse);		ICmpInst OptimizeMax(ICmpInst Cond, IVStrideUse* &CondUse);
void OptimizeLoopTermCond();		void OptimizeLoopTermCond();
		SmallVector<Instruction, 8> FindUnrolledReduction(Instruction *Inst,
		PHINode *Phi,
		bool first);
		void TransformUnrolledReduction(SmallVector<Instruction, 8> Chain,
		PHINode *Phi);
		void OptimizeUnrolledReductions();

void ChainInstruction(Instruction UserInst, Instruction IVOper,		void ChainInstruction(Instruction UserInst, Instruction IVOper,
SmallVectorImpl<ChainUsers> &ChainUsersVec);		SmallVectorImpl<ChainUsers> &ChainUsersVec);
void FinalizeChain(IVChain &Chain);		void FinalizeChain(IVChain &Chain);
void CollectChains();		void CollectChains();
void GenerateIVChain(const IVChain &Chain, SCEVExpander &Rewriter,		void GenerateIVChain(const IVChain &Chain, SCEVExpander &Rewriter,
SmallVectorImpl<WeakVH> &DeadInsts);		SmallVectorImpl<WeakVH> &DeadInsts);

▲ Show 20 Lines • Show All 523 Lines • ▼ Show 20 Lines	BasicBlock *BB =
Inst->getParent());		Inst->getParent());
if (BB == Inst->getParent())		if (BB == Inst->getParent())
IVIncInsertPos = Inst;		IVIncInsertPos = Inst;
else if (BB != IVIncInsertPos->getParent())		else if (BB != IVIncInsertPos->getParent())
IVIncInsertPos = BB->getTerminator();		IVIncInsertPos = BB->getTerminator();
}		}
}		}

		/// FindUnrolledReduction - Return chain of instructions corresponding to
		/// unrolled iterations of a reduction in a loop, if any.
		SmallVector<Instruction, 8>
		LSRInstance::FindUnrolledReduction(Instruction Inst, PHINode Phi,
		bool first = true) {

		// Unsafe algebra would be required in order to reorder FP operations.
		if (isa<FPMathOperator>(Inst) && !Inst->getFastMathFlags().unsafeAlgebra())
		return NULL;

		// Each value computed in the chain should be used exactly once, except for
		// the final reduction variable which can be arbitrarily used at loop exit.
		if (first) {
		for (User *U : Inst->users())
		if (isa<Instruction>(U) && U != Phi &&
		cast<Instruction>(U)->getParent() == Inst->getParent())
		return NULL;
		}
		else if (Inst->getNumUses() > 1)
		return NULL;

		for (unsigned int i = 0; i < Inst->getNumOperands(); ++i) {
		Value *Operand = Inst->getOperand(i);

		// Subtractions are not commutative, only consider first operand.
		if ((Inst->getOpcode() == Instruction::Sub \|\|
		Inst->getOpcode() == Instruction::FSub) && i > 0)
		break;

		// Walk through the dependency chain as long as the same opcode is
		// encountered.
		if (isa<BinaryOperator>(Operand)) {
		BinaryOperator *BinaryOp = cast<BinaryOperator>(Operand);

		if (BinaryOp->getOpcode() == Inst->getOpcode()) {
		SmallVector<Instruction, 8> Chain =
		FindUnrolledReduction(BinaryOp, Phi, false);

		if (Chain) {
		Chain->push_back(Inst);
		return Chain;
		}
		}
		}

		// Specified phi node was encountered, which means we have a dependency
		// chain.
		else if (isa<PHINode>(Operand)) {
		if (Phi == cast<PHINode>(Operand)) {
		SmallVector<Instruction, 8> Chain =
		new SmallVector<Instruction*, 8>();
		Chain->push_back(Inst);
		return Chain;
		}
		}
		}
		return NULL;
		}

		/// TransformUnrolledReduction - Break dependencies in the chain.
		void
		LSRInstance::TransformUnrolledReduction(SmallVector<Instruction, 8> Chain,
		PHINode *Phi) {
		assert((Chain && !Chain->empty()) && "empty chains are not allowed");
		BasicBlock *LoopBlock = Phi->getParent();
		BasicBlock *LoopExit = LoopBlock->getNextNode();

		// Retrieve register used for reduction.
		Value *Val = Phi->getIncomingValueForBlock(LoopBlock);
		assert((isa<Instruction>(Val)) && "unexpected dependency chain");
		Instruction *Red = cast<Instruction>(Val);

		// Opcode-specific parameters.
		Instruction::BinaryOps Opcode = (Instruction::BinaryOps)(Red->getOpcode());
		int neutral = 0;

		if (Opcode == Instruction::And \|\|
		Opcode == Instruction::Mul \|\|
		Opcode == Instruction::FMul)
		neutral = 1;

		if (Opcode == Instruction::Sub)
		Opcode = Instruction::Add;
		else if (Opcode == Instruction::FSub)
		Opcode = Instruction::FAdd;

		// Retrieve reduction at loop exit.
		bool renamed = false;
		for (User *U : Red->users()) {
		if (isa<PHINode>(U) && cast<PHINode>(U)->getParent() == LoopExit) {
		Red = cast<Instruction>(U);
		renamed = true;
		break;
		}
		}

		// We will insert code at the loop exit to compute a reduction based on many
		// reduction variables, one for each element in the chain (i.e. as many as
		// the unroll factor for the loop).
		Instruction *InsertionPt = LoopExit->getFirstNonPHI();
		Instruction *Previous = Phi;
		Instruction NewRed = Red, FirstRed = NULL;

		for (Instruction Inst : Chain) {
		bool isLast = (Phi->getIncomingValueForBlock(LoopBlock) == Inst);
		Twine Name = Inst->getName();

		PHINode *NewPhi;
		if (!isLast) {
		// Create new reduction variable.
		NewPhi = PHINode::Create(Inst->getType(), 2, Name + ".acc",
		LoopBlock->begin());

		Constant *Zero = NULL;
		if (Inst->getType()->isFloatingPointTy())
		Zero = ConstantFP::get(Inst->getType(), neutral);
		else
		Zero = ConstantInt::get(Inst->getType(), neutral, false);

		NewPhi->addIncoming(Zero, LoopBlock->getPrevNode());
		NewPhi->addIncoming(Inst, LoopBlock);
		}
		else
		NewPhi = Phi; // Reuse existing reduction variable.

		// Break dependencies using the new reduction variable.
		bool found = false;
		for (unsigned int i = 0; i < Inst->getNumOperands(); ++i) {
		if (Inst->getOperand(i) == Previous) {
		Inst->setOperand(i, NewPhi);
		found = true;
		break;
		}
		}

		assert(found && "unexpected depedency chain");

		if (!isLast) {
		// Reduce all reduction variables at loop exit.
		NewPhi = PHINode::Create(Inst->getType(), 1, Name + ".phi",
		LoopExit->begin());
		NewPhi->addIncoming(Inst, LoopBlock);

		NewRed = BinaryOperator::Create(Opcode, NewPhi, NewRed, Name + ".red",
		InsertionPt);

		// Remember the first operation as we will replace one of its uses (see
		// below).
		if (FirstRed == NULL)
		FirstRed = NewRed;
		}

		Previous = Inst;
		}

		// Replace uses of the original reduction variable with uses of the new one.
		Red->replaceAllUsesWith(NewRed);
		FirstRed->setOperand(1, Red);
		if (!renamed)
		for (unsigned int i = 0; i < Phi->getNumIncomingValues(); ++i)
		if (Phi->getIncomingValue(i) == NewRed) {
		Phi->setIncomingValue(i, Red);
		break;
		}
		}


		/// OptimizeUnrolledReductions - Break dependencies between unrolled iterations
		/// of reductions in loops. This should be particularly effective for
		/// superscalar targets.
		void
		LSRInstance::OptimizeUnrolledReductions() {

		std::vector<BasicBlock*>::const_iterator BBIter, BBEnd;

		for (BBIter = L->getBlocks().begin(), BBEnd = L->getBlocks().end();
		BBIter != BBEnd; ++BBIter) {
		SmallVector<SmallVector<Instruction, 8>, 2> Chains;
		SmallVector<PHINode*, 2> PhiNodes;

		// Search for phi nodes.
		for (BasicBlock::iterator I = (BBIter)->begin(), E = (BBIter)->end();
		I != E; ++I) {
		if (isa<PHINode>(I)) {
		for (unsigned int i = 0; i < I->getNumOperands(); ++i) {
		Value *Operand = I->getOperand(i);

		// Search for binary operators used within phi nodes.
		if (isa<BinaryOperator>(Operand)) {
		BinaryOperator *BinaryOp = cast<BinaryOperator>(Operand);

		// Must be in same basic block.
		if (BinaryOp->getParent() == I->getParent()) {

		// Look for interesting opcodes.
		if (BinaryOp->getOpcode() == Instruction::FAdd \|\|
		BinaryOp->getOpcode() == Instruction::FSub \|\|
		BinaryOp->getOpcode() == Instruction::FMul \|\|
		BinaryOp->getOpcode() == Instruction::Add \|\|
		BinaryOp->getOpcode() == Instruction::Sub \|\|
		BinaryOp->getOpcode() == Instruction::Mul \|\|
		BinaryOp->getOpcode() == Instruction::And \|\|
		BinaryOp->getOpcode() == Instruction::Or \|\|
		BinaryOp->getOpcode() == Instruction::Xor) {
		PHINode *Phi = cast<PHINode>(I);
		SmallVector<Instruction, 8> Chain =
		FindUnrolledReduction(BinaryOp, Phi);

		if (Chain && Chain->size() > 1) {
		// Found a dependency chain corresponding to a reduction,
		// record if length greater than one (phi node excluded).
		Chains.push_back(Chain);
		PhiNodes.push_back(Phi);
		}
		}
		}
		}
		}
		}
		}

		// Break dependency chains once the basic block has been processed.
		while (Chains.size() > 0 && PhiNodes.size() > 0) {
		SmallVector<Instruction, 8> Chain = Chains.pop_back_val();
		PHINode *Phi = PhiNodes.pop_back_val();
		TransformUnrolledReduction(Chain, Phi);
		delete(Chain);
		}
		}
		}

/// reconcileNewOffset - Determine if the given use can accommodate a fixup		/// reconcileNewOffset - Determine if the given use can accommodate a fixup
/// at the given offset and other details. If so, update the use and		/// at the given offset and other details. If so, update the use and
/// return true.		/// return true.
bool		bool
LSRInstance::reconcileNewOffset(LSRUse &LU, int64_t NewOffset, bool HasBaseReg,		LSRInstance::reconcileNewOffset(LSRUse &LU, int64_t NewOffset, bool HasBaseReg,
LSRUse::KindType Kind, Type *AccessTy) {		LSRUse::KindType Kind, Type *AccessTy) {
int64_t NewMinOffset = LU.MinOffset;		int64_t NewMinOffset = LU.MinOffset;
int64_t NewMaxOffset = LU.MaxOffset;		int64_t NewMaxOffset = LU.MaxOffset;
▲ Show 20 Lines • Show All 2,658 Lines • ▼ Show 20 Lines	#endif // DEBUG

DEBUG(dbgs() << "\nLSR on loop ";		DEBUG(dbgs() << "\nLSR on loop ";
L->getHeader()->printAsOperand(dbgs(), /PrintType=/false);		L->getHeader()->printAsOperand(dbgs(), /PrintType=/false);
dbgs() << ":\n");		dbgs() << ":\n");

// First, perform some low-level loop optimizations.		// First, perform some low-level loop optimizations.
OptimizeShadowIV();		OptimizeShadowIV();
OptimizeLoopTermCond();		OptimizeLoopTermCond();
		OptimizeUnrolledReductions();

// If loop preparation eliminates all interesting IV users, bail.		// If loop preparation eliminates all interesting IV users, bail.
if (IU.empty()) return;		if (IU.empty()) return;

// Skip nested loops until we can model them better with formulae.		// Skip nested loops until we can model them better with formulae.
if (!L->empty()) {		if (!L->empty()) {
DEBUG(dbgs() << "LSR skipping outer loop " << *L << "\n");		DEBUG(dbgs() << "LSR skipping outer loop " << *L << "\n");
return;		return;
▲ Show 20 Lines • Show All 191 Lines • Show Last 20 Lines

test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll

	; RUN: llc < %s -O3 -march=x86-64 -mcpu=core2 \| FileCheck %s -check-prefix=X64			; RUN: llc < %s -O3 -march=x86-64 -mcpu=core2 \| FileCheck %s -check-prefix=X64
	; RUN: llc < %s -O3 -march=x86 -mcpu=core2 \| FileCheck %s -check-prefix=X32			; RUN: llc < %s -O3 -march=x86 -mcpu=core2 \| FileCheck %s -check-prefix=X32
	; RUN: llc < %s -O3 -march=x86-64 -mcpu=core2 -addr-sink-using-gep=1 \| FileCheck %s -check-prefix=X64			; RUN: llc < %s -O3 -march=x86-64 -mcpu=core2 -addr-sink-using-gep=1 \| FileCheck %s -check-prefix=X64
	; RUN: llc < %s -O3 -march=x86 -mcpu=core2 -addr-sink-using-gep=1 \| FileCheck %s -check-prefix=X32			; RUN: llc < %s -O3 -march=x86 -mcpu=core2 -addr-sink-using-gep=1 \| FileCheck %s -check-prefix=X32

	; @simple is the most basic chain of address induction variables. Chaining			; @simple is the most basic chain of address induction variables. Chaining
	; saves at least one register and avoids complex addressing and setup			; saves at least one register and avoids complex addressing and setup
	; code.			; code.
	;			;
	; X64: @simple			; X64: @simple
	; %x * 4			; %x * 4
	; X64: shlq $2			; X64: shlq $2
	; no other address computation in the preheader			; no other address computation in the preheader
	; X64-NEXT: xorl			; X64-NEXT: xorl
	; X64-NEXT: .align			; X64: .align
	; X64: %loop			; X64: %loop
	; no complex address modes			; no complex address modes
	; X64-NOT: (%{{[^)]+}},%{{[^)]+}},			; X64-NOT: (%{{[^)]+}},%{{[^)]+}},
	;			;
	; X32: @simple			; X32: @simple
	; no expensive address computation in the preheader			; no expensive address computation in the preheader
	; X32-NOT: imul			; X32-NOT: imul
	; X32: %loop			; X32: %loop
	▲ Show 20 Lines • Show All 279 Lines • Show Last 20 Lines

test/Transforms/LoopStrengthReduce/unrolled-reduction.ll

				; RUN: opt < %s -loop-reduce -S \| FileCheck %s

				; CHECK-LABEL: loop:
				; CHECK: .acc = phi
				; CHECK-NEXT: .acc = phi
				; CHECK-NEXT: .acc = phi

				; CHECK: add
				; CHECK-NEXT: add
				; CHECK-NEXT: add
				; CHECK-NEXT: add

				; CHECK-LABEL: exit
				; CHECK: .phi = phi
				; CHECK-NEXT: .phi = phi
				; CHECK-NEXT: .phi = phi

				; CHECK: .red = add
				; CHECK-NEXT: .red = add
				; CHECK-NEXT: .red = add

				define i32 @add(i32* %a, i32* %b, i32 %x) nounwind {
				entry:
				br label %loop
				loop:
				%iv = phi i32* [ %a, %entry ], [ %iv4, %loop ]
				%s = phi i32 [ 0, %entry ], [ %s4, %loop ]
				%v = load i32* %iv
				%iv1 = getelementptr inbounds i32* %iv, i32 %x
				%v1 = load i32* %iv1
				%iv2 = getelementptr inbounds i32* %iv1, i32 %x
				%v2 = load i32* %iv2
				%iv3 = getelementptr inbounds i32* %iv2, i32 %x
				%v3 = load i32* %iv3
				%s1 = add i32 %s, %v
				%s2 = add i32 %s1, %v1
				%s3 = add i32 %s2, %v2
				%s4 = add i32 %s3, %v3
				%iv4 = getelementptr inbounds i32* %iv3, i32 %x
				%cmp = icmp eq i32* %iv4, %b
				br i1 %cmp, label %exit, label %loop
				exit:
				ret i32 %s4
				}

				; CHECK-LABEL: loop:
				; CHECK: .acc = phi
				; CHECK-NEXT: .acc = phi
				; CHECK-NEXT: .acc = phi

				; CHECK: sub
				; CHECK-NEXT: sub
				; CHECK-NEXT: sub
				; CHECK-NEXT: sub

				; CHECK-LABEL: exit
				; CHECK: .phi = phi
				; CHECK-NEXT: .phi = phi
				; CHECK-NEXT: .phi = phi

				; CHECK: .red = add
				; CHECK-NEXT: .red = add
				; CHECK-NEXT: .red = add

				define i32 @sub(i32* %a, i32* %b, i32 %x) nounwind {
				entry:
				br label %loop
				loop:
				%iv = phi i32* [ %a, %entry ], [ %iv4, %loop ]
				%s = phi i32 [ 0, %entry ], [ %s4, %loop ]
				%v = load i32* %iv
				%iv1 = getelementptr inbounds i32* %iv, i32 %x
				%v1 = load i32* %iv1
				%iv2 = getelementptr inbounds i32* %iv1, i32 %x
				%v2 = load i32* %iv2
				%iv3 = getelementptr inbounds i32* %iv2, i32 %x
				%v3 = load i32* %iv3
				%s1 = sub i32 %s, %v
				%s2 = sub i32 %s1, %v1
				%s3 = sub i32 %s2, %v2
				%s4 = sub i32 %s3, %v3
				%iv4 = getelementptr inbounds i32* %iv3, i32 %x
				%cmp = icmp eq i32* %iv4, %b
				br i1 %cmp, label %exit, label %loop
				exit:
				ret i32 %s4
				}

				; CHECK-LABEL: loop:
				; CHECK: .acc = phi
				; CHECK-NEXT: .acc = phi
				; CHECK-NEXT: .acc = phi

				; CHECK: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd
				; CHECK-NEXT: fadd

				; CHECK-LABEL: exit
				; CHECK: .phi = phi
				; CHECK-NEXT: .phi = phi
				; CHECK-NEXT: .phi = phi

				; CHECK: .red = fadd
				; CHECK-NEXT: .red = fadd
				; CHECK-NEXT: .red = fadd

				define float @fadd(float* %a, float* %b, i32 %x) nounwind {
				entry:
				br label %loop
				loop:
				%iv = phi float* [ %a, %entry ], [ %iv4, %loop ]
				%s = phi float [ 0.0, %entry ], [ %s4, %loop ]
				%v = load float* %iv
				%iv1 = getelementptr inbounds float* %iv, i32 %x
				%v1 = load float* %iv1
				%iv2 = getelementptr inbounds float* %iv1, i32 %x
				%v2 = load float* %iv2
				%iv3 = getelementptr inbounds float* %iv2, i32 %x
				%v3 = load float* %iv3
				%s1 = fadd fast float %s, %v
				%s2 = fadd fast float %s1, %v1
				%s3 = fadd fast float %s2, %v2
				%s4 = fadd fast float %s3, %v3
				%iv4 = getelementptr inbounds float* %iv3, i32 %x
				%cmp = icmp eq float* %iv4, %b
				br i1 %cmp, label %exit, label %loop
				exit:
				ret float %s4
				}

				; CHECK-LABEL: loop:
				; CHECK: .acc = phi
				; CHECK-NEXT: .acc = phi
				; CHECK-NEXT: .acc = phi

				; CHECK: fsub
				; CHECK-NEXT: fsub
				; CHECK-NEXT: fsub
				; CHECK-NEXT: fsub

				; CHECK-LABEL: exit
				; CHECK: .phi = phi
				; CHECK-NEXT: .phi = phi
				; CHECK-NEXT: .phi = phi

				; CHECK: .red = fadd
				; CHECK-NEXT: .red = fadd
				; CHECK-NEXT: .red = fadd

				define float @fsub(float* %a, float* %b, i32 %x) nounwind {
				entry:
				br label %loop
				loop:
				%iv = phi float* [ %a, %entry ], [ %iv4, %loop ]
				%s = phi float [ 0.0, %entry ], [ %s4, %loop ]
				%v = load float* %iv
				%iv1 = getelementptr inbounds float* %iv, i32 %x
				%v1 = load float* %iv1
				%iv2 = getelementptr inbounds float* %iv1, i32 %x
				%v2 = load float* %iv2
				%iv3 = getelementptr inbounds float* %iv2, i32 %x
				%v3 = load float* %iv3
				%s1 = fsub fast float %s, %v
				%s2 = fsub fast float %s1, %v1
				%s3 = fsub fast float %s2, %v2
				%s4 = fsub fast float %s3, %v3
				%iv4 = getelementptr inbounds float* %iv3, i32 %x
				%cmp = icmp eq float* %iv4, %b
				br i1 %cmp, label %exit, label %loop
				exit:
				ret float %s4
				}

This is an archive of the discontinued LLVM Phabricator instance.

Optimize unrolled reductions in LoopStrengthReduceNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 18617

lib/Transforms/Scalar/LoopStrengthReduce.cpp

test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll

test/Transforms/LoopStrengthReduce/unrolled-reduction.ll

Optimize unrolled reductions in LoopStrengthReduce
Needs ReviewPublic