This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
10/21
IndVarSimplify.cpp
-
test/Transforms/IndVarSimplify/
-
Transforms/
-
IndVarSimplify/
1/1
iv-widen-elim-ext.ll

Differential D49151

[SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand.
ClosedPublic

Authored by az on Jul 10 2018, 12:55 PM.

Download Raw Diff

Details

Reviewers

sanjoy
efriedma
javed.absar
sebpop

Commits

rGc30dfb2dfc3d: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Laod…
rL341726: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Laod…

Summary

One of the transformation done by the SimplyIndVar is to widen instructions, eliminate sign/zero extend instructions, and reduce the generation of truncate instructions when legal. Let's consider the following common C code fragment within a loop:
p = *(base + x*i); // i is the loop induction variable

If x is some load instruction that is hoisted outside the loop by LICM, then SimplyIndVar generates optimal code by not emitting any truncate instructions after widening. In case x is not hoisted, then the code generated is sub-optimal and it is mainly because SimplyIndVar relies on scalar evolution that can not handle loop variant expression. This patch handle the non-hoisted case and generates similar code to the hoisted case (see output of .ll file with and without patch).

The performance effect of redundant truncate and extend instructions can be big on strength reduction and on the backend when choosing the appropriate addressing mode.

No performance change on spec for ARM A72 but significant improvement on proprietary benchmark. Note that an alternative to this patch is to move SimplifyIndVar pass after PRE because PRE hoists more loop invariant code than LICM given that it uses a less conservative but expensive version of alias analysis. We opted not to make any change into pass ordering which can affect performance more than a local change.

Diff Detail

Event Timeline

az created this revision.Jul 10 2018, 12:55 PM

Herald added a reviewer: javed.absar. · View Herald TranscriptJul 10 2018, 12:55 PM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

How does the code generation actually change on aarch64? As far as I can tell, you're basically just changing an "sxtw #2" to an "lsl #2"; does that save a uop on A72?

Note that an alternative to this patch is to move SimplifyIndVar pass after PRE because PRE hoists more loop invariant code than LICM given that it uses a less conservative but expensive version of alias analysis.

That might work in some cases, but not in general, so this is probably worth solving anyway...

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1379	Not sure the getExtendExpr helper is actually buying anything, given you have to check the extension kind anyway.
1431	It probably isn't a good idea to create a new multiply without erasing the old one.
llvm/test/Transforms/IndVarSimplify/iv-widen-elim-ext.ll
310	Please fix the testcase so this load isn't dead (to make the testcase less fragile).

In D49151#1158061, @efriedma wrote:

How does the code generation actually change on aarch64? As far as I can tell, you're basically just changing an "sxtw #2" to an "lsl #2"; does that save a uop on A72?

For the unit test case added in this patch, it does not show much in terms of performance. It only shows how SimplifyIndVar, without the patch, generates different code for similar multiply instructions (%mul = ... and %mul1= ...) that differ only in the first operand being hoisted outside the loop or not. For the hoisted case, it widens the multiply, inserts a sign extend outside the loop, and removes the sign extend instruction that comes after mul. In the non-hoisted case, it adds a truncate instruction and leaves the sign extend after the multiply. This patch tries to make SimplifyIndVar generates the same code for both cases especially that code for the hoisted case seems more efficient and cleaner to work with for passes that runs later. In order to see performance improvement due to this patch, let's consider this C Code:
struct info1 { int C };
struct info2 { int data };
void foo(struct info1* in, struct info2* out, int N, unsigned char* p) {

     int p0, p1, p2;
     for (int x = 1; x < N; ++x) {
       p0 = *(p + (x+1) * in->C);
       p1 = *(p + (x-1) * in->C);
       p2 = *(p + (x-2) * in->C);
       out[N + x].data = p0 - p1 + p2;
     }
return;

}
Without the Patch, here is the AArch64 assembly:

ldr     w9, [x0]
add     x11, x1, w2, sxtw #2
mov     w12, w2
mov     w8, wzr
add     x11, x11, #4            // =4
neg     w10, w9
sub     x12, x12, #1            // =1
lsl     w13, w9, #1

.LBB0_2: // %for.body

                                    // =>This Inner Loop Header: Depth=1
add     w15, w13, w8
ldrb    w14, [x3, w8, sxtw]
ldrb    w15, [x3, w15, sxtw]
add     w16, w10, w8
ldrb    w16, [x3, w16, sxtw]
add     w8, w8, w9
subs    x12, x12, #1            // =1
sub     w14, w15, w14
add     w14, w14, w16
str     w14, [x11], #4
b.ne    .LBB0_2

.LBB0_3: // %for.cond.cleanup

ret

With the patch, here is the AArch64 generated assembly

ldrsw   x8, [x0]
add     x10, x1, w2, sxtw #2
mov     w12, w2
sub     x12, x12, #1            // =1
add     x10, x10, #4            // =4
neg     x9, x8
lsl     x11, x8, #1

.LBB0_2: // %for.body

                                    // =>This Inner Loop Header: Depth=1
ldrb    w13, [x3, x11]
ldrb    w15, [x3]
ldrb    w14, [x3, x9]
add     x3, x3, x8
sub     w13, w13, w15
add     w13, w13, w14
subs    x12, x12, #1            // =1
str     w13, [x10], #4
b.ne    .LBB0_2

.LBB0_3: // %for.cond.cleanup

ret

There is a performance improvement with the patch due to the fact that most variables involved in computing the addresses of the ldrb instructions are computed outside the loop. The redundant truncate and sign extend instructions that goes into loop strength reduction and in particular induction rewrite does not allow this pass to generate the most efficient code.

az added inline comments.Jul 13 2018, 4:03 PM

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1379	I took the code that does the non-scalar evolution legality check (opcode = {add..}, call to hasNoSignWrap(), etc.) and put it, as is, in a function called getExtendExpr that both existing code and patch code calls it. It may have a little bit of redundancy but I can rewrite it if needed by un-putting it in a function and replicate what I need in terms of legality check.
1431	Actually, this is cloning the use instruction and not removing it yet. The old instruction is removed by the defining instruction when we return up the call chain if it has no other use. This is my understanding of how things worked with widening the use. However, I am adding some code in the new revision to immediately remove the new instruction when not useful. I used to leave the instruction unused and hoping that it will be removed by dead code elimination.

evandro added a subscriber: evandro.Jul 20 2018, 12:09 PM

Ping

Your example doesn't really help make the case for this patch. The load in that test is actually loop-invariant; we just don't figure that out until after indvars transforms the induction variable. Probably LICM could be fixed to handle this case earlier. Then ultimately, the multiply goes away; the extra operation you're trying to get rid of is actually the sign-extension of a PHI node created by LSR. LSR and/or SCEV could probably be fixed so this produces an i64 PHI instead. Either of those fixes would be more straightforward and more obviously profitable.

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1379	Okay, it's probably fine as-is.
1431	We need to make sure to avoid the situation where after this transform runs, there's an i32 multiply and an i64 multiply. It's possible IV rewriting could lead to this sort of situation anyway in certain edge cases, I guess (haven't looked too closely), but multiplies are relatively expensive, and i64 multiplies are more expensive than i32 multiplies. Actually, more generally, we probably need to weigh the extra cost of the multiply; on multiple targets, an i64 multiply is substantially slower than an i32 multiply.

az updated this revision to Diff 157396.Jul 25 2018, 4:39 PM

In D49151#1174443, @efriedma wrote:

Your example doesn't really help make the case for this patch. The load in that test is actually loop-invariant; we just don't figure that out until after indvars transforms the induction variable. Probably LICM could be fixed to handle this case earlier. Then ultimately, the multiply goes away; the extra operation you're trying to get rid of is actually the sign-extension of a PHI node created by LSR. LSR and/or SCEV could probably be fixed so this produces an i64 PHI instead. Either of those fixes would be more straightforward and more obviously profitable.

My example is a greatly simplified example of the actual benchmark but you are absolutely right on the suggestion that we should solve this problem in LICM. If we can do that, SimplifyIndVar would work with clean hoisted loads and LICM would be improved in general. That was my original approach too and a good portion of the work for this performance issue was spent on trying to improve LICM/AA. What I found out is that LICM uses a simple/fast but conservative Type-Based Alias Analysis (AliasSet). I would have to make major changes to that alias analysis to solve my problem and it is unlikely to have it accepted given the strong push back against adding too much complexity into AliasSet. Then, I also thought about making LICM use the more accurate but expansive MemDep alias analysis in a similar way that GVN based PRE uses it (Memdep gives good Alias Analysis info for my real benchmark). But given that the LICM pass is called numerous times, this would add substantial compile time. Given that I did not have an agreeable fix in LICM/AA, I went to the next best place to put a fix which is SimplifyIndVar. It is the next best place because 1) it is there where the truncate instruction and inefficient code is first generated, 2) it is an early pass and I prefer that we clean code early on instead of letting inefficient IR go through other passes, and 3) It makes the IndvarSimp widening optimization more solid because let's consider we have two similar C examples with the only difference being one with a Load inside the loop and the other with a Load outside the loop (LICM could not hoist because of real legality issue or because of limitation of licm such as my case). The widening generates quite different IRs for both cases even though the Load being outside or inside is irrelevant to widening itself. In other words, we want it to generate similar type of code for closely similar input IR.

az added inline comments.Jul 25 2018, 4:57 PM

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1431	Good catch. I updated the patch so that it does not generate two versions of the instructions, an i32 and i64 (for the example below a two versions of the multiply is a possibility). I have done so by making sure that the instruction in question is consumed by a s/z extend instruction to get a full benefit of widening and eliminate the extend instruction. Having said that, it is most likely the original code is suffering from the same issue of generating multiple version of the instructions. I will look at it post the patch. As for the general idea of adding a cost model for the widening optimization, it is worth investigating it (may be along with the other optimizations within SimplifyIndVar in case they do not have a cost model). In case you are aware of such examples, please share but I think you have good point. We are transforming the code by widening and elimination of some instructions without checking on profitability.

az marked an inline comment as done.Jul 25 2018, 4:58 PM

Some cleaning and update to comments.

az added a reviewer: sebpop.Aug 23 2018, 2:09 PM

sebpop added inline comments.Sep 4 2018, 11:43 AM

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1393	clang-format
1415	Variable names should start with a Capital letter.
1423	Rewrite like so: if (!isa<SCEVSignExtendExpr>(ExtendOperExpr)) return false;
1429	idem: use isa<> shorter format.
1464	Can you please remove all the unneeded parentheses? Also I find it more clear if you first check for all cases that exit the loop with `WideningUseful = false; break;`
1468	There is a path on which we would transform the code in this rAUW stmt, and in a later iteration will fail in the `else break` clause. Can we transform this loop such that we split the analysis phase from code generation part? Maybe by using a vector of the things to be replaced. The analysis part that may fail with `return false;` should be moved before `// Generating a widening use instruction.` You can also split all the code gen part in a separate function that does not fail, and call it from below...
1590	... and call it from here. if (analysisFails()) return nullptr; codeGenWiden();

sebpop added inline comments.Sep 4 2018, 11:48 AM

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1590	this should be: if (analysisSucceeds()) { codeGenWiden(); return nullptr; }

junlim added a subscriber: junlim.Sep 5 2018, 8:28 AM

az updated this revision to Diff 164243.Sep 6 2018, 10:45 AM

az marked 5 inline comments as done.Sep 6 2018, 10:50 AM

az added inline comments.

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1464	Removed one. I tried to remove another one that should work fine based on C precedence rule, but I got some compiler warning. So, it is slightly better but still parentheses heavy.
1590	Done what you have in mind which is completely separate analysis from code Gen but I still kept them in the same function with clearly marked analysis phase and codegen phase mainly because they share code that may need to be re-executed if separated. I Can separate them into two functions if you still think that this small enhancement to widening need an analysis function and a code gen function.

sebpop added inline comments.Sep 6 2018, 12:42 PM

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp

1440

Let's break this down into smaller statements that can be read with ease:
here is my suggestion

if (ExtKind == SignExtended) {
  for () {
    auto *User = ...;
    if (isa<SExtInst>(User) && User->getType() == WideType)
      continue;
    return false;
  }
} else { // ExtKind == ZeroExtended
  for () {
    auto *User = ...;
    if (isa<ZExtInst>(User) && User->getType() == WideType)
      continue;
    return false;
  }
}

1590

Let's split the function into two smaller ones.

az updated this revision to Diff 164285.Sep 6 2018, 2:06 PM

az marked 2 inline comments as done.

The patch looks good to me.
Please address the last two comments and then apply.

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp
1439	Let's simplify the loop like so: for () { ZExtInst *User = dyn_cast<ZExtInst>(U.getUser()); if (!User \|\| User->getType() != WideType) return false; } and the same for the signExtend loop.
1506	please remove return stmt.

This revision is now accepted and ready to land.Sep 6 2018, 4:18 PM

az updated this revision to Diff 164504.Sep 7 2018, 1:43 PM

az marked 2 inline comments as done.

Closed by commit rL341726: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Laod… (authored by az). · Explain WhySep 7 2018, 3:43 PM

This revision was automatically updated to reflect the committed changes.

zhongduo mentioned this in D73059: [IndVarSimplify] Extend previous special case for load use instruction to any narrow type loop variant to avoid extra trunc instruction.Jan 20 2020, 11:11 AM

dancgr mentioned this in rGeae228a292f5: [IndVarSimplify] Extend previous special case for load use instruction to any….Mar 5 2020, 1:45 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

IndVarSimplify.cpp

131 lines

test/

Transforms/

IndVarSimplify/

iv-widen-elim-ext.ll

84 lines

Diff 164243

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp

Show First 20 Lines • Show All 1,011 Lines • ▼ Show 20 Lines	protected:
WidenedRecTy getExtendedOperandRecurrence(NarrowIVDefUse DU);		WidenedRecTy getExtendedOperandRecurrence(NarrowIVDefUse DU);

const SCEV getSCEVByOpCode(const SCEV LHS, const SCEV *RHS,		const SCEV getSCEVByOpCode(const SCEV LHS, const SCEV *RHS,
unsigned OpCode) const;		unsigned OpCode) const;

Instruction *widenIVUse(NarrowIVDefUse DU, SCEVExpander &Rewriter);		Instruction *widenIVUse(NarrowIVDefUse DU, SCEVExpander &Rewriter);

bool widenLoopCompare(NarrowIVDefUse DU);		bool widenLoopCompare(NarrowIVDefUse DU);
		bool widenWithVariantLoadUse(NarrowIVDefUse DU);

void pushNarrowIVUsers(Instruction NarrowDef, Instruction WideDef);		void pushNarrowIVUsers(Instruction NarrowDef, Instruction WideDef);
};		};

} // end anonymous namespace		} // end anonymous namespace

/// Perform a quick domtree based check for loop invariance assuming that V is		/// Perform a quick domtree based check for loop invariance assuming that V is
/// used within the loop. LoopInfo::isLoopInvariant() seems gratuitous for this		/// used within the loop. LoopInfo::isLoopInvariant() seems gratuitous for this
▲ Show 20 Lines • Show All 328 Lines • ▼ Show 20 Lines	bool WidenIV::widenLoopCompare(NarrowIVDefUse DU) {
// Widen the other operand of the compare, if necessary.		// Widen the other operand of the compare, if necessary.
if (CastWidth < IVWidth) {		if (CastWidth < IVWidth) {
Value *ExtOp = createExtendInst(Op, WideType, Cmp->isSigned(), Cmp);		Value *ExtOp = createExtendInst(Op, WideType, Cmp->isSigned(), Cmp);
DU.NarrowUse->replaceUsesOfWith(Op, ExtOp);		DU.NarrowUse->replaceUsesOfWith(Op, ExtOp);
}		}
return true;		return true;
}		}

		/// If the narrow use is an instruction whose two operands are the defining
		/// instruction of DU and a load instruction, then we have the following:
		/// if the load is hoisted outside the loop, then we do not reach this function
		/// as scalar evolution analysis works fine in widenIVUse with variables
		/// hoisted outside the loop and efficient code is subsequently generated by
		/// not emitting truncate instructions. But when the load is not hoisted
		/// (whether due to limitation in alias analysis or due to a true legality),
		/// then scalar evolution can not proceed with loop variant values and
		/// inefficient code is generated. This function handles the non-hoisted load
		/// special case by making the optimization generate the same type of code for
		/// hoisted and non-hoisted load (widen use and eliminate sign extend
		/// instruction). This special case is important especially when the induction
		/// variables are affecting addressing mode in code generation.
		bool WidenIV::widenWithVariantLoadUse(NarrowIVDefUse DU) {
		// 1. Analysis Phase - return false if it not legal or profitable to widen
		efriedmaUnsubmitted Not Done Reply Inline Actions Not sure the getExtendExpr helper is actually buying anything, given you have to check the extension kind anyway. efriedma: Not sure the getExtendExpr helper is actually buying anything, given you have to check the…
		azAuthorUnsubmitted Not Done Reply Inline Actions I took the code that does the non-scalar evolution legality check (opcode = {add..}, call to hasNoSignWrap(), etc.) and put it, as is, in a function called getExtendExpr that both existing code and patch code calls it. It may have a little bit of redundancy but I can rewrite it if needed by un-putting it in a function and replicate what I need in terms of legality check. az: I took the code that does the non-scalar evolution legality check (opcode = {add..}, call to…
		efriedmaUnsubmitted Not Done Reply Inline Actions Okay, it's probably fine as-is. efriedma: Okay, it's probably fine as-is.
		Instruction *NarrowUse = DU.NarrowUse;
		Instruction *NarrowDef = DU.NarrowDef;
		Instruction *WideDef = DU.WideDef;

		// Handle the common case of add<nsw/nuw>
		const unsigned OpCode = NarrowUse->getOpcode();
		// Only Add/Sub/Mul instructions are supported.
		if (OpCode != Instruction::Add && OpCode != Instruction::Sub &&
		OpCode != Instruction::Mul)
		return false;

		// The operand that is not defined by NarrowDef of DU. Let's call it the
		// other operand.
		unsigned ExtendOperIdx = DU.NarrowUse->getOperand(0) == NarrowDef ? 1 : 0;
		sebpopUnsubmitted Done Reply Inline Actions clang-format sebpop: clang-format
		assert(DU.NarrowUse->getOperand(1 - ExtendOperIdx) == DU.NarrowDef &&
		"bad DU");

		const SCEV *ExtendOperExpr = nullptr;
		const OverflowingBinaryOperator *OBO =
		cast<OverflowingBinaryOperator>(NarrowUse);
		ExtendKind ExtKind = getExtendKind(NarrowDef);
		if (ExtKind == SignExtended && OBO->hasNoSignedWrap())
		ExtendOperExpr = SE->getSignExtendExpr(
		SE->getSCEV(NarrowUse->getOperand(ExtendOperIdx)), WideType);
		else if (ExtKind == ZeroExtended && OBO->hasNoUnsignedWrap())
		ExtendOperExpr = SE->getZeroExtendExpr(
		SE->getSCEV(NarrowUse->getOperand(ExtendOperIdx)), WideType);
		else
		return false;

		// We are interested in the other operand being a load instruction.
		// But, we should look into relaxing this restriction later on.
		auto *I = dyn_cast<Instruction>(NarrowUse->getOperand(ExtendOperIdx));
		if (I && I->getOpcode() != Instruction::Load)
		return false;

		sebpopUnsubmitted Done Reply Inline Actions Variable names should start with a Capital letter. sebpop: Variable names should start with a Capital letter.
		// Verifying that Defining operand is an AddRec
		const SCEV *Op1 = SE->getSCEV(WideDef);
		const SCEVAddRecExpr *AddRecOp1 = dyn_cast<SCEVAddRecExpr>(Op1);
		if (!AddRecOp1 \|\| AddRecOp1->getLoop() != L)
		return false;
		// Verifying that other operand is an Extend.
		if (ExtKind == SignExtended) {
		//if (!dyn_cast<SCEVSignExtendExpr>(ExtendOperExpr))
		sebpopUnsubmitted Done Reply Inline Actions Rewrite like so: if (!isa<SCEVSignExtendExpr>(ExtendOperExpr)) return false; sebpop: Rewrite like so: if (!isa<SCEVSignExtendExpr>(ExtendOperExpr)) return false;
		if (!isa<SCEVSignExtendExpr>(ExtendOperExpr))
		return false;
		} else {
		//if (!dyn_cast<SCEVZeroExtendExpr>(ExtendOperExpr))
		if (!isa<SCEVZeroExtendExpr>(ExtendOperExpr))
		return false;
		sebpopUnsubmitted Done Reply Inline Actions idem: use isa<> shorter format. sebpop: idem: use isa<> shorter format.
		}

		efriedmaUnsubmitted Not Done Reply Inline Actions It probably isn't a good idea to create a new multiply without erasing the old one. efriedma: It probably isn't a good idea to create a new multiply without erasing the old one.
		azAuthorUnsubmitted Not Done Reply Inline Actions Actually, this is cloning the use instruction and not removing it yet. The old instruction is removed by the defining instruction when we return up the call chain if it has no other use. This is my understanding of how things worked with widening the use. However, I am adding some code in the new revision to immediately remove the new instruction when not useful. I used to leave the instruction unused and hoping that it will be removed by dead code elimination. az: Actually, this is cloning the use instruction and not removing it yet. The old instruction is…
		efriedmaUnsubmitted Done Reply Inline Actions We need to make sure to avoid the situation where after this transform runs, there's an i32 multiply and an i64 multiply. It's possible IV rewriting could lead to this sort of situation anyway in certain edge cases, I guess (haven't looked too closely), but multiplies are relatively expensive, and i64 multiplies are more expensive than i32 multiplies. Actually, more generally, we probably need to weigh the extra cost of the multiply; on multiple targets, an i64 multiply is substantially slower than an i32 multiply. efriedma: We need to make sure to avoid the situation where after this transform runs, there's an i32…
		azAuthorUnsubmitted Not Done Reply Inline Actions Good catch. I updated the patch so that it does not generate two versions of the instructions, an i32 and i64 (for the example below a two versions of the multiply is a possibility). I have done so by making sure that the instruction in question is consumed by a s/z extend instruction to get a full benefit of widening and eliminate the extend instruction. Having said that, it is most likely the original code is suffering from the same issue of generating multiple version of the instructions. I will look at it post the patch. As for the general idea of adding a cost model for the widening optimization, it is worth investigating it (may be along with the other optimizations within SimplifyIndVar in case they do not have a cost model). In case you are aware of such examples, please share but I think you have good point. We are transforming the code by widening and elimination of some instructions without checking on profitability. az: Good catch. I updated the patch so that it does not generate two versions of the instructions…
		// Profitability: Check if widening the use eliminates sign/zero extend
		// instructions. In other words, check that widening helps all use and
		// not just this DU.
		for (Use &U : NarrowUse->uses()) {
		auto *User = cast<Instruction>(U.getUser());
		if (!(((isa<SExtInst>(User) && ExtKind == SignExtended) \|\|
		(isa<ZExtInst>(User) && ExtKind == ZeroExtended)) &&
		User->getType() == WideType)) {
		sebpopUnsubmitted Done Reply Inline Actions Let's simplify the loop like so: for () { ZExtInst User = dyn_cast<ZExtInst>(U.getUser()); if (!User \|\| User->getType() != WideType) return false; } and the same for the signExtend loop. sebpop:* Let's simplify the loop like so: for () { ZExtInst *User = dyn_cast<ZExtInst>(U.getUser…
		return false;
		sebpopUnsubmitted Done Reply Inline Actions Let's break this down into smaller statements that can be read with ease: here is my suggestion if (ExtKind == SignExtended) { for () { auto User = ...; if (isa<SExtInst>(User) && User->getType() == WideType) continue; return false; } } else { // ExtKind == ZeroExtended for () { auto User = ...; if (isa<ZExtInst>(User) && User->getType() == WideType) continue; return false; } } sebpop: Let's break this down into smaller statements that can be read with ease: here is my suggestion…
		}
		}

		// 2. Code Gen Phase
		LLVM_DEBUG(dbgs() << "Cloning arithmetic IVUser: " << *NarrowUse << "\n");

		// Generating a widening use instruction.
		Value *LHS = (NarrowUse->getOperand(0) == NarrowDef)
		? WideDef
		: createExtendInst(NarrowUse->getOperand(0), WideType,
		ExtKind, NarrowUse);
		Value *RHS = (NarrowUse->getOperand(1) == NarrowDef)
		? WideDef
		: createExtendInst(NarrowUse->getOperand(1), WideType,
		ExtKind, NarrowUse);

		auto *NarrowBO = cast<BinaryOperator>(NarrowUse);
		auto *WideBO = BinaryOperator::Create(NarrowBO->getOpcode(), LHS, RHS,
		NarrowBO->getName());
		IRBuilder<> Builder(NarrowUse);
		Builder.Insert(WideBO);
		WideBO->copyIRFlags(NarrowBO);

		if (ExtKind == SignExtended)
		sebpopUnsubmitted Not Done Reply Inline Actions Can you please remove all the unneeded parentheses? Also I find it more clear if you first check for all cases that exit the loop with `WideningUseful = false; break;` sebpop: Can you please remove all the unneeded parentheses? Also I find it more clear if you first…
		azAuthorUnsubmitted Not Done Reply Inline Actions Removed one. I tried to remove another one that should work fine based on C precedence rule, but I got some compiler warning. So, it is slightly better but still parentheses heavy. az: Removed one. I tried to remove another one that should work fine based on C precedence rule…
		ExtendKindMap[NarrowUse] = SignExtended;
		else
		ExtendKindMap[NarrowUse] = ZeroExtended;

		sebpopUnsubmitted Done Reply Inline Actions There is a path on which we would transform the code in this rAUW stmt, and in a later iteration will fail in the `else break` clause. Can we transform this loop such that we split the analysis phase from code generation part? Maybe by using a vector of the things to be replaced. The analysis part that may fail with `return false;` should be moved before `// Generating a widening use instruction.` You can also split all the code gen part in a separate function that does not fail, and call it from below... sebpop: There is a path on which we would transform the code in this rAUW stmt, and in a later…
		// Check if widening the use eliminates sign/zero extend instructions.
		// Update the Use.
		for (Use &U : NarrowUse->uses()) {
		auto *User = cast<Instruction>(U.getUser());
		if (((isa<SExtInst>(User) && getExtendKind(NarrowUse) == SignExtended) \|\|
		(isa<ZExtInst>(User) && getExtendKind(NarrowUse) == ZeroExtended)) &&
		User->getType() == WideType) {
		LLVM_DEBUG(dbgs() << "INDVARS: eliminating " << *User << " replaced by "
		<< *WideBO << "\n");
		++NumElimExt;
		User->replaceAllUsesWith(WideBO);
		DeadInsts.emplace_back(User);
		}
		}

		return true;
		}

/// Determine whether an individual user of the narrow IV can be widened. If so,		/// Determine whether an individual user of the narrow IV can be widened. If so,
/// return the wide clone of the user.		/// return the wide clone of the user.
Instruction *WidenIV::widenIVUse(NarrowIVDefUse DU, SCEVExpander &Rewriter) {		Instruction *WidenIV::widenIVUse(NarrowIVDefUse DU, SCEVExpander &Rewriter) {
assert(ExtendKindMap.count(DU.NarrowDef) &&		assert(ExtendKindMap.count(DU.NarrowDef) &&
"Should already know the kind of extension used to widen NarrowDef");		"Should already know the kind of extension used to widen NarrowDef");

// Stop traversing the def-use chain at inner-loop phis or post-loop phis.		// Stop traversing the def-use chain at inner-loop phis or post-loop phis.
if (PHINode *UsePhi = dyn_cast<PHINode>(DU.NarrowUse)) {		if (PHINode *UsePhi = dyn_cast<PHINode>(DU.NarrowUse)) {
if (LI->getLoopFor(UsePhi->getParent()) != L) {		if (LI->getLoopFor(UsePhi->getParent()) != L) {
// For LCSSA phis, sink the truncate outside the loop.		// For LCSSA phis, sink the truncate outside the loop.
// After SimplifyCFG most loop exit targets have a single predecessor.		// After SimplifyCFG most loop exit targets have a single predecessor.
// Otherwise fall back to a truncate within the loop.		// Otherwise fall back to a truncate within the loop.
if (UsePhi->getNumOperands() != 1)		if (UsePhi->getNumOperands() != 1)
truncateIVUse(DU, DT, LI);		truncateIVUse(DU, DT, LI);
else {		else {
// Widening the PHI requires us to insert a trunc. The logical place		// Widening the PHI requires us to insert a trunc. The logical place
// for this trunc is in the same BB as the PHI. This is not possible if		// for this trunc is in the same BB as the PHI. This is not possible if
// the BB is terminated by a catchswitch.		// the BB is terminated by a catchswitch.
if (isa<CatchSwitchInst>(UsePhi->getParent()->getTerminator()))		if (isa<CatchSwitchInst>(UsePhi->getParent()->getTerminator()))
return nullptr;		return nullptr;
		sebpopUnsubmitted Done Reply Inline Actions please remove return stmt. sebpop: please remove return stmt.

PHINode *WidePhi =		PHINode *WidePhi =
PHINode::Create(DU.WideDef->getType(), 1, UsePhi->getName() + ".wide",		PHINode::Create(DU.WideDef->getType(), 1, UsePhi->getName() + ".wide",
UsePhi);		UsePhi);
WidePhi->addIncoming(DU.WideDef, UsePhi->getIncomingBlock(0));		WidePhi->addIncoming(DU.WideDef, UsePhi->getIncomingBlock(0));
IRBuilder<> Builder(&*WidePhi->getParent()->getFirstInsertionPt());		IRBuilder<> Builder(&*WidePhi->getParent()->getFirstInsertionPt());
Value *Trunc = Builder.CreateTrunc(WidePhi, DU.NarrowDef->getType());		Value *Trunc = Builder.CreateTrunc(WidePhi, DU.NarrowDef->getType());
UsePhi->replaceAllUsesWith(Trunc);		UsePhi->replaceAllUsesWith(Trunc);
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	Instruction *WidenIV::widenIVUse(NarrowIVDefUse DU, SCEVExpander &Rewriter) {

assert((WideAddRec.first == nullptr) == (WideAddRec.second == Unknown));		assert((WideAddRec.first == nullptr) == (WideAddRec.second == Unknown));
if (!WideAddRec.first) {		if (!WideAddRec.first) {
// If use is a loop condition, try to promote the condition instead of		// If use is a loop condition, try to promote the condition instead of
// truncating the IV first.		// truncating the IV first.
if (widenLoopCompare(DU))		if (widenLoopCompare(DU))
return nullptr;		return nullptr;

		// We are here about to generate a truncate instruction that may hurt
		// performance because the scalar evolution expression computed earlier
		// in WideAddRec.first does not indicate a polynomial induction expression.
		// In that case, look at the operands of the use instruction to determine
		// if we can still widen the use instead of truncating its operand.
		if (widenWithVariantLoadUse(DU))
		return nullptr;
		sebpopUnsubmitted Not Done Reply Inline Actions ... and call it from here. if (analysisFails()) return nullptr; codeGenWiden(); sebpop: ... and call it from here. if (analysisFails()) return nullptr; codeGenWiden();
		sebpopUnsubmitted Not Done Reply Inline Actions this should be: if (analysisSucceeds()) { codeGenWiden(); return nullptr; } sebpop: this should be: if (analysisSucceeds()) { codeGenWiden(); return nullptr; }
		azAuthorUnsubmitted Not Done Reply Inline Actions Done what you have in mind which is completely separate analysis from code Gen but I still kept them in the same function with clearly marked analysis phase and codegen phase mainly because they share code that may need to be re-executed if separated. I Can separate them into two functions if you still think that this small enhancement to widening need an analysis function and a code gen function. az: Done what you have in mind which is completely separate analysis from code Gen but I still kept…
		sebpopUnsubmitted Done Reply Inline Actions Let's split the function into two smaller ones. sebpop: Let's split the function into two smaller ones.

// This user does not evaluate to a recurrence after widening, so don't		// This user does not evaluate to a recurrence after widening, so don't
// follow it. Instead insert a Trunc to kill off the original use,		// follow it. Instead insert a Trunc to kill off the original use,
// eventually isolating the original narrow IV so it can be removed.		// eventually isolating the original narrow IV so it can be removed.
truncateIVUse(DU, DT, LI);		truncateIVUse(DU, DT, LI);
return nullptr;		return nullptr;
}		}
// Assume block terminators cannot evaluate to a recurrence. We can't to		// Assume block terminators cannot evaluate to a recurrence. We can't to
// insert a Trunc after a terminator if there happens to be a critical edge.		// insert a Trunc after a terminator if there happens to be a critical edge.
▲ Show 20 Lines • Show All 1,125 Lines • Show Last 20 Lines

llvm/test/Transforms/IndVarSimplify/iv-widen-elim-ext.ll

	Show First 20 Lines • Show All 267 Lines • ▼ Show 20 Lines

	for.cond.for.end_crit_edge: ; preds = %for.inc			for.cond.for.end_crit_edge: ; preds = %for.inc
	br label %for.end			br label %for.end

	for.end: ; preds = %for.cond.for.end_crit_edge, %entry			for.end: ; preds = %for.cond.for.end_crit_edge, %entry
	%call = call i32 @dummy(i32* getelementptr inbounds ([100 x i32], [100 x i32]* @a, i32 0, i32 0), i32* getelementptr inbounds ([100 x i32], [100 x i32]* @b, i32 0, i32 0))			%call = call i32 @dummy(i32* getelementptr inbounds ([100 x i32], [100 x i32]* @a, i32 0, i32 0), i32* getelementptr inbounds ([100 x i32], [100 x i32]* @b, i32 0, i32 0))
	ret i32 0			ret i32 0
	}			}

				%struct.image = type {i32, i32}
				define i32 @foo4(%struct.image* %input, i32 %length, i32* %in) {
				entry:
				%stride = getelementptr inbounds %struct.image, %struct.image* %input, i64 0, i32 1
				%0 = load i32, i32* %stride, align 4
				%cmp17 = icmp sgt i32 %length, 1
				br i1 %cmp17, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				%channel = getelementptr inbounds %struct.image, %struct.image* %input, i64 0, i32 0
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %for.body
				%1 = phi i32 [ %6, %for.body ]
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				%2 = phi i32 [ 0, %entry ], [ %1, %for.cond.cleanup.loopexit ]
				ret i32 %2

				; mul instruction below is widened instead of generating a truncate instruction for it
				; regardless if Load operand of mul is inside or outside the loop (we have both cases).
				; CHECK: for.body:
				; CHECK-NOT: trunc
				for.body: ; preds = %for.body.lr.ph, %for.body
				%x.018 = phi i32 [ 1, %for.body.lr.ph ], [ %add, %for.body ]
				%add = add nuw nsw i32 %x.018, 1
				%3 = load i32, i32* %channel, align 8
				%mul = mul nsw i32 %3, %add
				%idx.ext = sext i32 %mul to i64
				%add.ptr = getelementptr inbounds i32, i32* %in, i64 %idx.ext
				%4 = load i32, i32* %add.ptr, align 4
				%mul1 = mul nsw i32 %0, %add
				%idx.ext1 = sext i32 %mul1 to i64
				efriedmaUnsubmitted Done Reply Inline Actions Please fix the testcase so this load isn't dead (to make the testcase less fragile). efriedma: Please fix the testcase so this load isn't dead (to make the testcase less fragile).
				%add.ptr1 = getelementptr inbounds i32, i32* %in, i64 %idx.ext1
				%5 = load i32, i32* %add.ptr1, align 4
				%6 = add i32 %4, %5
				%cmp = icmp slt i32 %add, %length
				br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit
				}


				define i32 @foo5(%struct.image* %input, i32 %length, i32* %in) {
				entry:
				%stride = getelementptr inbounds %struct.image, %struct.image* %input, i64 0, i32 1
				%0 = load i32, i32* %stride, align 4
				%cmp17 = icmp sgt i32 %length, 1
				br i1 %cmp17, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				%channel = getelementptr inbounds %struct.image, %struct.image* %input, i64 0, i32 0
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %for.body
				%1 = phi i32 [ %7, %for.body ]
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				%2 = phi i32 [ 0, %entry ], [ %1, %for.cond.cleanup.loopexit ]
				ret i32 %2

				; This example is the same as above except that the first mul is used in two places
				; and this may result in having two versions of the multiply: an i32 and i64 version.
				; In this case, keep the trucate instructions to avoid this redundancy.
				; CHECK: for.body:
				; CHECK: trunc
				for.body: ; preds = %for.body.lr.ph, %for.body
				%x.018 = phi i32 [ 1, %for.body.lr.ph ], [ %add, %for.body ]
				%add = add nuw nsw i32 %x.018, 1
				%3 = load i32, i32* %channel, align 8
				%mul = mul nsw i32 %3, %add
				%idx.ext = sext i32 %mul to i64
				%add.ptr = getelementptr inbounds i32, i32* %in, i64 %idx.ext
				%4 = load i32, i32* %add.ptr, align 4
				%mul1 = mul nsw i32 %0, %add
				%idx.ext1 = sext i32 %mul1 to i64
				%add.ptr1 = getelementptr inbounds i32, i32* %in, i64 %idx.ext1
				%5 = load i32, i32* %add.ptr1, align 4
				%6 = add i32 %4, %5
				%7 = add i32 %6, %mul
				%cmp = icmp slt i32 %add, %length
				br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 164243

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp

llvm/test/Transforms/IndVarSimplify/iv-widen-elim-ext.ll

[SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand.
ClosedPublic