This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Analysis/
-
Analysis/
-
ScalarEvolutionExpander.cpp
-
test/Transforms/LoopStrengthReduce/X86/
-
Transforms/
-
LoopStrengthReduce/
-
X86/
-
bin_power.ll

Differential D34025

[SCEV] Teach SCEVExpander to expand BinPow
ClosedPublic

Authored by mkazantsev on Jun 8 2017, 2:46 AM.

Download Raw Diff

Details

Reviewers

sanjoy
reames
anna
apilipenko
dneilson
skatkov
qcolombet

Commits

rG35b2a18eb96d: [SCEV] Teach SCEVExpander to expand BinPow
rL305663: [SCEV] Teach SCEVExpander to expand BinPow

Summary

Current implementation of SCEVExpander demonstrates a very naive behavior when
it deals with power calculation. For example, a SCEV for x^8 looks like

(x * x * x * x * x * x * x * x)

If we try to expand it, it generates a very straightforward sequence of muls, like:

x2 = mul x, x
x3 = mul x2, x
x4 = mul x3, x
    ...
x8 = mul x7, x

This is a non-efficient way of doing that. A better way is to generate a sequence of
binary power calculation. In this case the expanded calculation will look like:

x2 = mul x, x
x4 = mul x2, x2
x8 = mul x4, x4

In some cases the code size reduction for such SCEVs is dramatic. If we had a loop:

x = a;
for (int i = 0; i < 3; i++)
  x = x * x;

And this loop have been fully unrolled, we have something like:

x = a;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;

The SCEV for x8 is the same as in example above, and if we for some reason
want to expand it, we will generate naively 7 multiplications instead of 3.
The BinPow expansion algorithm here allows to keep code size reasonable.

This patch teaches SCEV Expander to generate a sequence of BinPow multiplications
if we have repeating arguments in SCEVMulExpressions.

Diff Detail

Repository: rL LLVM

Event Timeline

mkazantsev created this revision.Jun 8 2017, 2:46 AM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptJun 8 2017, 2:46 AM

mkazantsev edited the summary of this revision. (Show Details)Jun 8 2017, 2:47 AM

mkazantsev added a reviewer: qcolombet.Jun 8 2017, 3:23 AM

Have you looked at what happens with addition? i.e.

x = a;
for (i = 0; i < N; i++) x = x + x;

Should still be an exponential number of additions, when it could be just (x * (N << 1)).

lib/Analysis/ScalarEvolutionExpander.cpp
766 ↗	(On Diff #101874)	How challenging would it be to enhance SCEVMulExpr to represent repeated multiplies by the same value as a pair (value, exponent) rather than as a list of the same value being repeated 'exponent' times? This patch has addressed not having to expand a ^ (2 ^ N) at runtime, but we still have an exponential number of terms in the OpsAndLoops list, which should quite negatively affect compile time.
773 ↗	(On Diff #101874)	If Ty is a float type and N isn't a trivial value (like 1 or 2) then it can be better (faster & more accurate) to generate a call to @llvm.powi rather than the tree of multiplies.

Daniel, regarding getAddExpr: SCEV tends to keep its expressions as flat as possible, so if it has two (or more) equal args in SCEVAddExpr, it transforms it into multiplication. You can try to invoke getAddExpr for (a, a, a) and it will return simething like (3 * a). So transforming Adds to Muls is not a problem, the problem with Muls is that we don't have a Power construct in SCEV that would be exactly what we need to represent such expressions. But as I said in the comment, we don't have it and it wouldn't be easy to introduce it.

lib/Analysis/ScalarEvolutionExpander.cpp
766 ↗	(On Diff #101874)	I agree with your concern. In fact what you propose is to introduce a new construct like SCEVPower. It would be extremely nice to have, but unfortunately it would be a really huge change affecting all SCEV.
773 ↗	(On Diff #101874)	SCEV doesn't work with FP types, so it is not our case. :) But if it did, such transformations changing the order of calculations could not even be legal at all.

Hi Maxim,

If I understand your motivation here correctly, I think we've already lost by creating a SCEVMulExpr with an exponentially large number of operands. IMO the right fix would be to prevent creating such expressions in the first place.

This revision now requires changes to proceed.Jun 9 2017, 2:36 PM

Sanjoy, what is the difference between having one SCEV with exponential number of operands or having a SCEV tree with 2 operands in each node, but the same number of operands in total? We won't win anything doing so.

In D34025#777416, @mkazantsev wrote:

Sanjoy, what is the difference between having one SCEV with exponential number of operands or having a SCEV tree with 2 operands in each node, but the same number of operands in total? We won't win anything doing so.

I don't think you'll need the same number of operands in total in the second case. IIUC, the problematic IR is:

x1 = x0 * x0;
x2 = x1 * x1;
x3 = x2 * x2;
x4 = x3 * x3;

where the SCEV expression for x4 ends up being (x0 * x0 * ... 16 times). This, in turn, gets lowered into:

xa = x0 * x0;
xb = xa * x0;
... 15 instructions

I'm postulating that the real problem was that we created a SCEV for x4 with 16 operands. If we instead had the SCEV for x4 be:

// Writing it this way since printing SCEVs obscures SCEV identity
S0 = x0 * x0;
S1 = S0 * S0;
S2 = S1 * S1;
S3 = S2 * S2;

then the expansion would naturally have been

xa = x0 * x0;  // expand(S0)
xb = xa * xa;  // expand(S1)
xc = xb * xb;  // expand(S2)
xd = xc * xc;  // expand(S3)

(Note that expand tries to re-use already expanded expressions, see InsertedExpressions).

Two caveats though:

Obviously, we don't want to prevent SCEV for merging multiplication-of-multiplication expressions completely -- some of that is required for good simplification. We just want to do that up to a sane limit.
This patch may still be a good idea, but for that you'll have to make a case by claiming that the new expanded form is a more canonical expansion of SCEVMulExpr s, not by claiming it is a bugfix. You'll also need to show that other passes don't get pessimized by such multiplication trees (unlikely).

This patch may still be a good idea, but for that you'll have to make a case by claiming that the new expanded form is a more canonical expansion of SCEVMulExpr s, not by claiming it is a bugfix. You'll also need to show that other passes don't get pessimized by such multiplication trees (unlikely).

Hi Sanjoy,

Actually an example of a better behavior is already present it test_03. If you look at it, you can notice that the new expander logic has changed the order of calculations. Maybe this example is not good enough because the overall number of multiplications remains the same, but if we have something like

x0 = a * a
x1 = x0 * a
x2 = x1 * a
...
xn = x(n-1) *a

then the new logic allows to reduce the overall number of multiplications from O(N) to O(log(N)). Would that be a sufficient justification of this change as improvement? I can add a test that does it.

As for limiting the operands merging to some limit - I am still not convinced that it is a good idea. What if we have the situation I described above, with n being like 1022? In fact, we have a chain of muls that calcupates a ^ 1024. It could come from a loop like:

x = a * a;
for (i = 0; i < 1022; i++)
  x = x * a

Which got fully unrolled. In this case it is profitable to merge one operands in one SCEV, figure out that we actually need a power of 2 and then expand it into BinPow, decreasing the number of resulting muls from 1023 to ~10. If we set a merge limit, we can fail to do that.

Anyways, setting a merge limit for muls would be a separate patch. If you still do think we need it, I will make it. As for this one, if I add a test on what I described (when it reduces the overall number of muls), would that work to approve it?

Thanks!

Re-justified this as being an improvement, added a test where a linear calculation is replaced with a logarithmic one. I will make a separate patch to avoid creation of SCEVs with unreasonably huge numbers of operands.

Ping

lgtm with nits

lib/Analysis/ScalarEvolutionExpander.cpp
767 ↗	(On Diff #102303)	Avoid braces here. You could also write this loop using `find_if`. Finally I'd perhaps suggest using a `uint64_t` for `Exponent` so that an overflow is practically impossible.
775 ↗	(On Diff #102303)	Can you get rid of this conditional and instead start `BinExp` from `1`?
792 ↗	(On Diff #102303)	It is a bit hard to read `ExpandOpBinPowN()` and tell what all local state the call touches. Do you mind extracting out a static helper function so that the data flow is more obvious?

This revision is now accepted and ready to land.Jun 16 2017, 9:04 PM

mkazantsev added inline comments.Jun 18 2017, 8:11 PM

lib/Analysis/ScalarEvolutionExpander.cpp
767 ↗	(On Diff #102303)	I cannot get rid of braces because of two operands inside. Will split them into two lines. Isn't it true that the operands here are sorted? If yes, we just need a consecutive sequence of similar operands, find_if here would be an overcomplication. Agreed with uint64_t.

mkazantsev added inline comments.Jun 18 2017, 8:17 PM

lib/Analysis/ScalarEvolutionExpander.cpp
775 ↗	(On Diff #102303)	That would make the loop a bit messy, since we are generating the current power differently for p = 1 and the rest, and because we need to generate the next operand before (possibly) using it. I will need an if in loop in this case, something like for (unsigned BinExp = 1; BinExp <= Exponent; BinExp <<= 1) { P = BinExp == 1 ? expandCodeFor(I->second, Ty) : InsertBinop(Instruction::Mul, P, P); if (Exponent & BinExp) Result = Result ? InsertBinop(Instruction::Mul, Result, P) : P; } It is hardly easier to read and worse performance-wise.

mkazantsev added inline comments.Jun 18 2017, 8:26 PM

lib/Analysis/ScalarEvolutionExpander.cpp
792 ↗	(On Diff #102303)	We call internal SCEVExpander method (like InsertBinop) inside. Do we really want a static helper that takes "this" as argument? I think the better way to make the data flow obvious here is to capture the required objects explicitly.

mkazantsev updated this revision to Diff 102987.Jun 18 2017, 10:29 PM

mkazantsev marked 3 inline comments as done.

mkazantsev added inline comments.

lib/Analysis/ScalarEvolutionExpander.cpp
767 ↗	(On Diff #102303)	Added a test that checks that the BinPow is expanded even if the calculation of SCEV included different operands.

Closed by commit rL305663: [SCEV] Teach SCEVExpander to expand BinPow (authored by mkazantsev). · Explain WhyJun 18 2017, 11:25 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Analysis/

ScalarEvolutionExpander.cpp

48 lines

test/

Transforms/

LoopStrengthReduce/

X86/

bin_power.ll

264 lines

Diff 102989

llvm/trunk/lib/Analysis/ScalarEvolutionExpander.cpp

Show First 20 Lines • Show All 742 Lines • ▼ Show 20 Lines	for (std::reverse_iterator<SCEVMulExpr::op_iterator> I(S->op_end()),
OpsAndLoops.push_back(std::make_pair(getRelevantLoop(I), I));		OpsAndLoops.push_back(std::make_pair(getRelevantLoop(I), I));

// Sort by loop. Use a stable sort so that constants follow non-constants.		// Sort by loop. Use a stable sort so that constants follow non-constants.
std::stable_sort(OpsAndLoops.begin(), OpsAndLoops.end(), LoopCompare(SE.DT));		std::stable_sort(OpsAndLoops.begin(), OpsAndLoops.end(), LoopCompare(SE.DT));

// Emit instructions to mul all the operands. Hoist as much as possible		// Emit instructions to mul all the operands. Hoist as much as possible
// out of loops.		// out of loops.
Value *Prod = nullptr;		Value *Prod = nullptr;
for (const auto &I : OpsAndLoops) {		auto I = OpsAndLoops.begin();
const SCEV *Op = I.second;
		// Expand the calculation of X pow N in the following manner:
		// Let N = P1 + P2 + ... + PK, where all P are powers of 2. Then:
		// X pow N = (X pow P1) * (X pow P2) * ... * (X pow PK).
		const auto ExpandOpBinPowN = [this, &I, &OpsAndLoops, &Ty]() {
		auto E = I;
		// Calculate how many times the same operand from the same loop is included
		// into this power.
		uint64_t Exponent = 0;
		const uint64_t MaxExponent = UINT64_MAX >> 1;
		// No one sane will ever try to calculate such huge exponents, but if we
		// need this, we stop on UINT64_MAX / 2 because we need to exit the loop
		// below when the power of 2 exceeds our Exponent, and we want it to be
		// 1u << 31 at most to not deal with unsigned overflow.
		while (E != OpsAndLoops.end() && I == E && Exponent != MaxExponent) {
		++Exponent;
		++E;
		}
		assert(Exponent > 0 && "Trying to calculate a zeroth exponent of operand?");

		// Calculate powers with exponents 1, 2, 4, 8 etc. and include those of them
		// that are needed into the result.
		Value *P = expandCodeFor(I->second, Ty);
		Value *Result = nullptr;
		if (Exponent & 1)
		Result = P;
		for (uint64_t BinExp = 2; BinExp <= Exponent; BinExp <<= 1) {
		P = InsertBinop(Instruction::Mul, P, P);
		if (Exponent & BinExp)
		Result = Result ? InsertBinop(Instruction::Mul, Result, P) : P;
		}

		I = E;
		assert(Result && "Nothing was expanded?");
		return Result;
		};

		while (I != OpsAndLoops.end()) {
if (!Prod) {		if (!Prod) {
// This is the first operand. Just expand it.		// This is the first operand. Just expand it.
Prod = expand(Op);		Prod = ExpandOpBinPowN();
} else if (Op->isAllOnesValue()) {		} else if (I->second->isAllOnesValue()) {
// Instead of doing a multiply by negative one, just do a negate.		// Instead of doing a multiply by negative one, just do a negate.
Prod = InsertNoopCastOfTo(Prod, Ty);		Prod = InsertNoopCastOfTo(Prod, Ty);
Prod = InsertBinop(Instruction::Sub, Constant::getNullValue(Ty), Prod);		Prod = InsertBinop(Instruction::Sub, Constant::getNullValue(Ty), Prod);
		++I;
} else {		} else {
// A simple mul.		// A simple mul.
Value *W = expandCodeFor(Op, Ty);		Value *W = ExpandOpBinPowN();
Prod = InsertNoopCastOfTo(Prod, Ty);		Prod = InsertNoopCastOfTo(Prod, Ty);
// Canonicalize a constant to the RHS.		// Canonicalize a constant to the RHS.
if (isa<Constant>(Prod)) std::swap(Prod, W);		if (isa<Constant>(Prod)) std::swap(Prod, W);
const APInt *RHS;		const APInt *RHS;
if (match(W, m_Power2(RHS))) {		if (match(W, m_Power2(RHS))) {
// Canonicalize Prod*(1<<C) to Prod<<C.		// Canonicalize Prod*(1<<C) to Prod<<C.
assert(!Ty->isVectorTy() && "vector types are not SCEVable");		assert(!Ty->isVectorTy() && "vector types are not SCEVable");
Prod = InsertBinop(Instruction::Shl, Prod,		Prod = InsertBinop(Instruction::Shl, Prod,
▲ Show 20 Lines • Show All 1,488 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopStrengthReduce/X86/bin_power.ll

				; RUN: opt < %s -loop-reduce -S \| FileCheck %s

				target datalayout = "e-m:e-i32:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; Show that the b^2 is expanded correctly.
				define i32 @test_01(i32 %a) {
				; CHECK-LABEL: @test_01
				; CHECK: entry:
				; CHECK-NEXT: br label %loop
				; CHECK: loop:
				; CHECK-NEXT: [[IV:[^ ]+]] = phi i32 [ [[IV_INC:[^ ]+]], %loop ], [ 0, %entry ]
				; CHECK-NEXT: [[IV_INC]] = add nsw i32 [[IV]], -1
				; CHECK-NEXT: [[EXITCOND:[^ ]+]] = icmp eq i32 [[IV_INC]], -80
				; CHECK-NEXT: br i1 [[EXITCOND]], label %exit, label %loop
				; CHECK: exit:
				; CHECK-NEXT: [[B:[^ ]+]] = add i32 %a, 1
				; CHECK-NEXT: [[B2:[^ ]+]] = mul i32 [[B]], [[B]]
				; CHECK-NEXT: [[R1:[^ ]+]] = add i32 [[B2]], -1
				; CHECK-NEXT: [[R2:[^ ]+]] = sub i32 [[R1]], [[IV_INC]]
				; CHECK-NEXT: ret i32 [[R2]]

				entry:
				br label %loop

				loop: ; preds = %loop, %entry
				%indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%b = add i32 %a, 1
				%b.pow.2 = mul i32 %b, %b
				%result = add i32 %b.pow.2, %indvars.iv
				%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
				%exitcond = icmp eq i32 %indvars.iv.next, 80
				br i1 %exitcond, label %exit, label %loop

				exit: ; preds = %loop
				ret i32 %result
				}

				; Show that b^8 is expanded correctly.
				define i32 @test_02(i32 %a) {
				; CHECK-LABEL: @test_02
				; CHECK: entry:
				; CHECK-NEXT: br label %loop
				; CHECK: loop:
				; CHECK-NEXT: [[IV:[^ ]+]] = phi i32 [ [[IV_INC:[^ ]+]], %loop ], [ 0, %entry ]
				; CHECK-NEXT: [[IV_INC]] = add nsw i32 [[IV]], -1
				; CHECK-NEXT: [[EXITCOND:[^ ]+]] = icmp eq i32 [[IV_INC]], -80
				; CHECK-NEXT: br i1 [[EXITCOND]], label %exit, label %loop
				; CHECK: exit:
				; CHECK-NEXT: [[B:[^ ]+]] = add i32 %a, 1
				; CHECK-NEXT: [[B2:[^ ]+]] = mul i32 [[B]], [[B]]
				; CHECK-NEXT: [[B4:[^ ]+]] = mul i32 [[B2]], [[B2]]
				; CHECK-NEXT: [[B8:[^ ]+]] = mul i32 [[B4]], [[B4]]
				; CHECK-NEXT: [[R1:[^ ]+]] = add i32 [[B8]], -1
				; CHECK-NEXT: [[R2:[^ ]+]] = sub i32 [[R1]], [[IV_INC]]
				; CHECK-NEXT: ret i32 [[R2]]
				entry:
				br label %loop

				loop: ; preds = %loop, %entry
				%indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%b = add i32 %a, 1
				%b.pow.2 = mul i32 %b, %b
				%b.pow.4 = mul i32 %b.pow.2, %b.pow.2
				%b.pow.8 = mul i32 %b.pow.4, %b.pow.4
				%result = add i32 %b.pow.8, %indvars.iv
				%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
				%exitcond = icmp eq i32 %indvars.iv.next, 80
				br i1 %exitcond, label %exit, label %loop

				exit: ; preds = %loop
				ret i32 %result
				}

				; Show that b^27 (27 = 1 + 2 + 8 + 16) is expanded correctly.
				define i32 @test_03(i32 %a) {
				; CHECK-LABEL: @test_03
				; CHECK: entry:
				; CHECK-NEXT: br label %loop
				; CHECK: loop:
				; CHECK-NEXT: [[IV:[^ ]+]] = phi i32 [ [[IV_INC:[^ ]+]], %loop ], [ 0, %entry ]
				; CHECK-NEXT: [[IV_INC]] = add nsw i32 [[IV]], -1
				; CHECK-NEXT: [[EXITCOND:[^ ]+]] = icmp eq i32 [[IV_INC]], -80
				; CHECK-NEXT: br i1 [[EXITCOND]], label %exit, label %loop
				; CHECK: exit:
				; CHECK-NEXT: [[B:[^ ]+]] = add i32 %a, 1
				; CHECK-NEXT: [[B2:[^ ]+]] = mul i32 [[B]], [[B]]
				; CHECK-NEXT: [[B3:[^ ]+]] = mul i32 [[B]], [[B2]]
				; CHECK-NEXT: [[B4:[^ ]+]] = mul i32 [[B2]], [[B2]]
				; CHECK-NEXT: [[B8:[^ ]+]] = mul i32 [[B4]], [[B4]]
				; CHECK-NEXT: [[B11:[^ ]+]] = mul i32 [[B3]], [[B8]]
				; CHECK-NEXT: [[B16:[^ ]+]] = mul i32 [[B8]], [[B8]]
				; CHECK-NEXT: [[B27:[^ ]+]] = mul i32 [[B11]], [[B16]]
				; CHECK-NEXT: [[R1:[^ ]+]] = add i32 [[B27]], -1
				; CHECK-NEXT: [[R2:[^ ]+]] = sub i32 [[R1]], [[IV_INC]]
				; CHECK-NEXT: ret i32 [[R2]]
				entry:
				br label %loop

				loop: ; preds = %loop, %entry
				%indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%b = add i32 %a, 1
				%b.pow.2 = mul i32 %b, %b
				%b.pow.4 = mul i32 %b.pow.2, %b.pow.2
				%b.pow.8 = mul i32 %b.pow.4, %b.pow.4
				%b.pow.16 = mul i32 %b.pow.8, %b.pow.8
				%b.pow.24 = mul i32 %b.pow.16, %b.pow.8
				%b.pow.25 = mul i32 %b.pow.24, %b
				%b.pow.26 = mul i32 %b.pow.25, %b
				%b.pow.27 = mul i32 %b.pow.26, %b
				%result = add i32 %b.pow.27, %indvars.iv
				%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
				%exitcond = icmp eq i32 %indvars.iv.next, 80
				br i1 %exitcond, label %exit, label %loop

				exit: ; preds = %loop
				ret i32 %result
				}

				; Show how linear calculation of b^16 is turned into logarithmic.
				define i32 @test_04(i32 %a) {
				; CHECK-LABEL: @test_04
				; CHECK: entry:
				; CHECK-NEXT: br label %loop
				; CHECK: loop:
				; CHECK-NEXT: [[IV:[^ ]+]] = phi i32 [ [[IV_INC:[^ ]+]], %loop ], [ 0, %entry ]
				; CHECK-NEXT: [[IV_INC]] = add nsw i32 [[IV]], -1
				; CHECK-NEXT: [[EXITCOND:[^ ]+]] = icmp eq i32 [[IV_INC]], -80
				; CHECK-NEXT: br i1 [[EXITCOND]], label %exit, label %loop
				; CHECK: exit:
				; CHECK-NEXT: [[B:[^ ]+]] = add i32 %a, 1
				; CHECK-NEXT: [[B2:[^ ]+]] = mul i32 [[B]], [[B]]
				; CHECK-NEXT: [[B4:[^ ]+]] = mul i32 [[B2]], [[B2]]
				; CHECK-NEXT: [[B8:[^ ]+]] = mul i32 [[B4]], [[B4]]
				; CHECK-NEXT: [[B16:[^ ]+]] = mul i32 [[B8]], [[B8]]
				; CHECK-NEXT: [[R1:[^ ]+]] = add i32 [[B16]], -1
				; CHECK-NEXT: [[R2:[^ ]+]] = sub i32 [[R1]], [[IV_INC]]
				; CHECK-NEXT: ret i32 [[R2]]
				entry:
				br label %loop

				loop: ; preds = %loop, %entry
				%indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%b = add i32 %a, 1
				%b.pow.2 = mul i32 %b, %b
				%b.pow.3 = mul i32 %b.pow.2, %b
				%b.pow.4 = mul i32 %b.pow.3, %b
				%b.pow.5 = mul i32 %b.pow.4, %b
				%b.pow.6 = mul i32 %b.pow.5, %b
				%b.pow.7 = mul i32 %b.pow.6, %b
				%b.pow.8 = mul i32 %b.pow.7, %b
				%b.pow.9 = mul i32 %b.pow.8, %b
				%b.pow.10 = mul i32 %b.pow.9, %b
				%b.pow.11 = mul i32 %b.pow.10, %b
				%b.pow.12 = mul i32 %b.pow.11, %b
				%b.pow.13 = mul i32 %b.pow.12, %b
				%b.pow.14 = mul i32 %b.pow.13, %b
				%b.pow.15 = mul i32 %b.pow.14, %b
				%b.pow.16 = mul i32 %b.pow.15, %b
				%result = add i32 %b.pow.16, %indvars.iv
				%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
				%exitcond = icmp eq i32 %indvars.iv.next, 80
				br i1 %exitcond, label %exit, label %loop

				exit: ; preds = %loop
				ret i32 %result
				}

				; The output here is reasonably big, we just check that the amount of expanded
				; instructions is sane.
				define i32 @test_05(i32 %a) {
				; CHECK-LABEL: @test_05
				; CHECK: entry:
				; CHECK-NEXT: br label %loop
				; CHECK: loop:
				; CHECK-NEXT: [[IV:[^ ]+]] = phi i32 [ [[IV_INC:[^ ]+]], %loop ], [ 0, %entry ]
				; CHECK-NEXT: [[IV_INC]] = add nsw i32 [[IV]], -1
				; CHECK-NEXT: [[EXITCOND:[^ ]+]] = icmp eq i32 [[IV_INC]], -80
				; CHECK-NEXT: br i1 [[EXITCOND]], label %exit, label %loop
				; CHECK: exit:
				; CHECK: %100
				; CHECK-NOT: %150

				entry:
				br label %loop

				loop: ; preds = %loop, %entry
				%indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%tmp3 = add i32 %a, 1
				%tmp4 = mul i32 %tmp3, %tmp3
				%tmp5 = mul i32 %tmp4, %tmp4
				%tmp6 = mul i32 %tmp5, %tmp5
				%tmp7 = mul i32 %tmp6, %tmp6
				%tmp8 = mul i32 %tmp7, %tmp7
				%tmp9 = mul i32 %tmp8, %tmp8
				%tmp10 = mul i32 %tmp9, %tmp9
				%tmp11 = mul i32 %tmp10, %tmp10
				%tmp12 = mul i32 %tmp11, %tmp11
				%tmp13 = mul i32 %tmp12, %tmp12
				%tmp14 = mul i32 %tmp13, %tmp13
				%tmp15 = mul i32 %tmp14, %tmp14
				%tmp16 = mul i32 %tmp15, %tmp15
				%tmp17 = mul i32 %tmp16, %tmp16
				%tmp18 = mul i32 %tmp17, %tmp17
				%tmp19 = mul i32 %tmp18, %tmp18
				%tmp20 = mul i32 %tmp19, %tmp19
				%tmp22 = add i32 %tmp20, %indvars.iv
				%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
				%exitcond = icmp eq i32 %indvars.iv.next, 80
				br i1 %exitcond, label %exit, label %loop

				exit: ; preds = %loop
				ret i32 %tmp22
				}

				; Show that the transformation works even if the calculation involves different
				; values inside.
				define i32 @test_06(i32 %a, i32 %c) {
				; CHECK-LABEL: @test_06
				; CHECK: entry:
				; CHECK-NEXT: br label %loop
				; CHECK: loop:
				; CHECK-NEXT: [[IV:[^ ]+]] = phi i32 [ [[IV_INC:[^ ]+]], %loop ], [ 0, %entry ]
				; CHECK-NEXT: [[IV_INC]] = add nsw i32 [[IV]], -1
				; CHECK-NEXT: [[EXITCOND:[^ ]+]] = icmp eq i32 [[IV_INC]], -80
				; CHECK-NEXT: br i1 [[EXITCOND]], label %exit, label %loop
				; CHECK: exit:
				; CHECK: [[B:[^ ]+]] = add i32 %a, 1
				; CHECK-NEXT: [[B2:[^ ]+]] = mul i32 [[B]], [[B]]
				; CHECK-NEXT: [[B4:[^ ]+]] = mul i32 [[B2]], [[B2]]
				; CHECK-NEXT: [[B8:[^ ]+]] = mul i32 [[B4]], [[B4]]
				; CHECK-NEXT: [[B16:[^ ]+]] = mul i32 [[B8]], [[B8]]
				entry:
				br label %loop

				loop: ; preds = %loop, %entry
				%indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%b = add i32 %a, 1
				%b.pow.2.tmp = mul i32 %b, %b
				%b.pow.2 = mul i32 %b.pow.2.tmp, %c
				%b.pow.3 = mul i32 %b.pow.2, %b
				%b.pow.4 = mul i32 %b.pow.3, %b
				%b.pow.5 = mul i32 %b.pow.4, %b
				%b.pow.6.tmp = mul i32 %b.pow.5, %b
				%b.pow.6 = mul i32 %b.pow.6.tmp, %c
				%b.pow.7 = mul i32 %b.pow.6, %b
				%b.pow.8 = mul i32 %b.pow.7, %b
				%b.pow.9 = mul i32 %b.pow.8, %b
				%b.pow.10 = mul i32 %b.pow.9, %b
				%b.pow.11 = mul i32 %b.pow.10, %b
				%b.pow.12.tmp = mul i32 %b.pow.11, %b
				%b.pow.12 = mul i32 %c, %b.pow.12.tmp
				%b.pow.13 = mul i32 %b.pow.12, %b
				%b.pow.14 = mul i32 %b.pow.13, %b
				%b.pow.15 = mul i32 %b.pow.14, %b
				%b.pow.16 = mul i32 %b.pow.15, %b
				%result = add i32 %b.pow.16, %indvars.iv
				%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
				%exitcond = icmp eq i32 %indvars.iv.next, 80
				br i1 %exitcond, label %exit, label %loop

				exit: ; preds = %loop
				ret i32 %result
				}