This is an archive of the discontinued LLVM Phabricator instance.

[IndVarSimplify] Extend previous special case for load use instruction to any narrow type loop variant to avoid extra trunc instruction
ClosedPublic

Authored by zhongduo on Jan 20 2020, 11:11 AM.

Download Raw Diff

Details

Reviewers

sanjoy
efriedma
sebpop
reames
az
javed.absar
amehsan

Commits

rGeae228a292f5: [IndVarSimplify] Extend previous special case for load use instruction to any…

Summary

The widenIVUse avoids generating trunc by evaluating the use as AddRec, this
will not work when:

SCEV traces back to an instruction inside the loop that SCEV can not

expand, eg. add %indvar, (load %addr)

SCEV finds a loop variant, eg. add %indvar, %loopvariant

While SCEV fails to avoid trunc, we can still try to use instruction
combining approach to prove trunc is not required. This can be further
extended with other instruction combining checks, but for now we handle the
following case (sub can be "add" and "mul", "nsw + sext" can be "nus + zext")

Src:
  %c = sub nsw %b, %indvar
  %d = sext %c to i64
Dst:
  %indvar.ext1 = sext %indvar to i64
  %m = sext %b to i64
  %d = sub nsw i64 %m, %indvar.ext1

Therefore, as long as the result of add/sub/mul is extended to wide type with
right extension and overflow wrap combination, no
trunc is required regardless of how %b is generated. This pattern is common
when calculating address in 64 bit architecture.

Note that this patch reuse almost all the code from D49151 by @az:
https://reviews.llvm.org/D49151

It extends it by providing proof of why trunc is unnecessary in more general case,
it should also resolve some of the concerns from the following discussion with @reames.

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180910/585945.html

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

zhongduo created this revision.Jan 20 2020, 11:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2020, 11:11 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

@sanjoy @efriedma @sebpop @reames @az @javed.absar Could you please review and provide some feedback to this patch? I would highly appreciate your help.

zhongduo added a reviewer: amehsan.Feb 4 2020, 5:43 PM

@sanjoy @efriedma @sebpop @reames @az @javed.absar

Is there any concern about this patch? We have reviewed this internally and it looks good to me. It has been also tested extensively. The condition on the load instruction does not seem to have any role in correctness of the transformations and seems to be a completely arbitrary condition. I accept the patch for now, but I will wait and prefer to hear from one of the other reviewers before committing it.

This revision is now accepted and ready to land.Mar 4 2020, 9:14 PM

The patch looks good to me. It is a generalization of the original patch. The only thing missing here is to describe the testing results or post them here (even though on paper, getting rid of a truncate instruction should be beneficial).

Some background: When I wrote the original patch, I had two versions: the one with the load restriction and the general one that is exactly what you have here. We tested both of them on Geekbench4 and SPEC 2000 on ARM (Exynos and A57). Both versions of the patch got around 10% improvement for one Geekbench kernel and with no change for the rest. That improvement was the reason for the patch. I chose to post the patch with the load restriction and limit the scope of the patch because we did not have the resource to test more and in particular test non ARM platforms.

@az This problem was exposed in a small test case that we were working on. For that particular one, we have around 8% but it might be the result of preventing other optimization from happening. I didn't see significant impact with other larger test cases.

@az

Our pipeline has some differences with the default pass pipeline. The problem was exposed when playing with the pipeline. As Jimmy mentioned it is in one of the smaller benchmarks in the test suite. Overall impact is small, but on the other hand, we are not adding any compile time or any other kind of cost. So I don't see an issue. The remaining question is functional stability. The code looks quite correct to me. You have also looked into it in the past so I think on that issue we are fine. There is always a chance that we expose some other bug somewhere else. We have not observed that issue and given limited impact of the patch it is not very likely. Anyways, I think overall this should be fine to be merged. Still I wait a little more to see if any of the reviewers has any concern in the next couple of days.

LGTM.

Since you discovered this problem with non-default pass pipeline, then I do not need to see your test. With the default llvm passes, it was easy for me to find tests with load where the patch can be beneficial because of limitation in alias analysis preventing some optimizations and this patch can clean up things. But, I was not able back then to find meaningful test with non-load where this patch can help. It is great that you have a test (with your non-default passes) where this patch can be useful. It seems that both of us tested this code very well on ARM. Also, theoretically removing an extra instruction should be beneficial for other architectures too.

az accepted this revision.Mar 5 2020, 10:47 AM

Closed by commit rGeae228a292f5: [IndVarSimplify] Extend previous special case for load use instruction to any… (authored by zhongduo, committed by dancgr). · Explain WhyMar 5 2020, 1:45 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

IndVarSimplify.cpp

55 lines

test/

Transforms/

IndVarSimplify/

iv-widen-elim-ext.ll

49 lines

Diff 248600

llvm/lib/Transforms/Scalar/IndVarSimplify.cpp

Show First 20 Lines • Show All 732 Lines • ▼ Show 20 Lines	protected:
WidenedRecTy getExtendedOperandRecurrence(NarrowIVDefUse DU);		WidenedRecTy getExtendedOperandRecurrence(NarrowIVDefUse DU);

const SCEV getSCEVByOpCode(const SCEV LHS, const SCEV *RHS,		const SCEV getSCEVByOpCode(const SCEV LHS, const SCEV *RHS,
unsigned OpCode) const;		unsigned OpCode) const;

Instruction *widenIVUse(NarrowIVDefUse DU, SCEVExpander &Rewriter);		Instruction *widenIVUse(NarrowIVDefUse DU, SCEVExpander &Rewriter);

bool widenLoopCompare(NarrowIVDefUse DU);		bool widenLoopCompare(NarrowIVDefUse DU);
bool widenWithVariantLoadUse(NarrowIVDefUse DU);		bool widenWithVariantUse(NarrowIVDefUse DU);
void widenWithVariantLoadUseCodegen(NarrowIVDefUse DU);		void widenWithVariantUseCodegen(NarrowIVDefUse DU);

void pushNarrowIVUsers(Instruction NarrowDef, Instruction WideDef);		void pushNarrowIVUsers(Instruction NarrowDef, Instruction WideDef);
};		};

} // end anonymous namespace		} // end anonymous namespace

Value WidenIV::createExtendInst(Value NarrowOper, Type *WideType,		Value WidenIV::createExtendInst(Value NarrowOper, Type *WideType,
bool IsSigned, Instruction *Use) {		bool IsSigned, Instruction *Use) {
▲ Show 20 Lines • Show All 321 Lines • ▼ Show 20 Lines	bool WidenIV::widenLoopCompare(NarrowIVDefUse DU) {
// Widen the other operand of the compare, if necessary.		// Widen the other operand of the compare, if necessary.
if (CastWidth < IVWidth) {		if (CastWidth < IVWidth) {
Value *ExtOp = createExtendInst(Op, WideType, Cmp->isSigned(), Cmp);		Value *ExtOp = createExtendInst(Op, WideType, Cmp->isSigned(), Cmp);
DU.NarrowUse->replaceUsesOfWith(Op, ExtOp);		DU.NarrowUse->replaceUsesOfWith(Op, ExtOp);
}		}
return true;		return true;
}		}

/// If the narrow use is an instruction whose two operands are the defining		// The widenIVUse avoids generating trunc by evaluating the use as AddRec, this
/// instruction of DU and a load instruction, then we have the following:		// will not work when:
/// if the load is hoisted outside the loop, then we do not reach this function		// 1) SCEV traces back to an instruction inside the loop that SCEV can not
/// as scalar evolution analysis works fine in widenIVUse with variables		// expand, eg. add %indvar, (load %addr)
/// hoisted outside the loop and efficient code is subsequently generated by		// 2) SCEV finds a loop variant, eg. add %indvar, %loopvariant
/// not emitting truncate instructions. But when the load is not hoisted		// While SCEV fails to avoid trunc, we can still try to use instruction
/// (whether due to limitation in alias analysis or due to a true legality),		// combining approach to prove trunc is not required. This can be further
/// then scalar evolution can not proceed with loop variant values and		// extended with other instruction combining checks, but for now we handle the
/// inefficient code is generated. This function handles the non-hoisted load		// following case (sub can be "add" and "mul", "nsw + sext" can be "nus + zext")
/// special case by making the optimization generate the same type of code for		//
/// hoisted and non-hoisted load (widen use and eliminate sign extend		// Src:
/// instruction). This special case is important especially when the induction		// %c = sub nsw %b, %indvar
/// variables are affecting addressing mode in code generation.		// %d = sext %c to i64
bool WidenIV::widenWithVariantLoadUse(NarrowIVDefUse DU) {		// Dst:
		// %indvar.ext1 = sext %indvar to i64
		// %m = sext %b to i64
		// %d = sub nsw i64 %m, %indvar.ext1
		// Therefore, as long as the result of add/sub/mul is extended to wide type, no
		// trunc is required regardless of how %b is generated. This pattern is common
		// when calculating address in 64 bit architecture
		bool WidenIV::widenWithVariantUse(NarrowIVDefUse DU) {
Instruction *NarrowUse = DU.NarrowUse;		Instruction *NarrowUse = DU.NarrowUse;
Instruction *NarrowDef = DU.NarrowDef;		Instruction *NarrowDef = DU.NarrowDef;
Instruction *WideDef = DU.WideDef;		Instruction *WideDef = DU.WideDef;

// Handle the common case of add<nsw/nuw>		// Handle the common case of add<nsw/nuw>
const unsigned OpCode = NarrowUse->getOpcode();		const unsigned OpCode = NarrowUse->getOpcode();
// Only Add/Sub/Mul instructions are supported.		// Only Add/Sub/Mul instructions are supported.
if (OpCode != Instruction::Add && OpCode != Instruction::Sub &&		if (OpCode != Instruction::Add && OpCode != Instruction::Sub &&
Show All 14 Lines	if (ExtKind == SignExtended && OBO->hasNoSignedWrap())
ExtendOperExpr = SE->getSignExtendExpr(		ExtendOperExpr = SE->getSignExtendExpr(
SE->getSCEV(NarrowUse->getOperand(ExtendOperIdx)), WideType);		SE->getSCEV(NarrowUse->getOperand(ExtendOperIdx)), WideType);
else if (ExtKind == ZeroExtended && OBO->hasNoUnsignedWrap())		else if (ExtKind == ZeroExtended && OBO->hasNoUnsignedWrap())
ExtendOperExpr = SE->getZeroExtendExpr(		ExtendOperExpr = SE->getZeroExtendExpr(
SE->getSCEV(NarrowUse->getOperand(ExtendOperIdx)), WideType);		SE->getSCEV(NarrowUse->getOperand(ExtendOperIdx)), WideType);
else		else
return false;		return false;

// We are interested in the other operand being a load instruction.
// But, we should look into relaxing this restriction later on.
auto *I = dyn_cast<Instruction>(NarrowUse->getOperand(ExtendOperIdx));
if (I && I->getOpcode() != Instruction::Load)
return false;

// Verifying that Defining operand is an AddRec		// Verifying that Defining operand is an AddRec
const SCEV *Op1 = SE->getSCEV(WideDef);		const SCEV *Op1 = SE->getSCEV(WideDef);
const SCEVAddRecExpr *AddRecOp1 = dyn_cast<SCEVAddRecExpr>(Op1);		const SCEVAddRecExpr *AddRecOp1 = dyn_cast<SCEVAddRecExpr>(Op1);
if (!AddRecOp1 \|\| AddRecOp1->getLoop() != L)		if (!AddRecOp1 \|\| AddRecOp1->getLoop() != L)
return false;		return false;
// Verifying that other operand is an Extend.		// Verifying that other operand is an Extend.
if (ExtKind == SignExtended) {		if (ExtKind == SignExtended) {
if (!isa<SCEVSignExtendExpr>(ExtendOperExpr))		if (!isa<SCEVSignExtendExpr>(ExtendOperExpr))
Show All 15 Lines	for (Use &U : NarrowUse->uses()) {
if (!User \|\| User->getType() != WideType)		if (!User \|\| User->getType() != WideType)
return false;		return false;
}		}
}		}

return true;		return true;
}		}

/// Special Case for widening with variant Loads (see		/// Special Case for widening with loop variant (see
/// WidenIV::widenWithVariantLoadUse). This is the code generation part.		/// WidenIV::widenWithVariant). This is the code generation part.
void WidenIV::widenWithVariantLoadUseCodegen(NarrowIVDefUse DU) {		void WidenIV::widenWithVariantUseCodegen(NarrowIVDefUse DU) {
Instruction *NarrowUse = DU.NarrowUse;		Instruction *NarrowUse = DU.NarrowUse;
Instruction *NarrowDef = DU.NarrowDef;		Instruction *NarrowDef = DU.NarrowDef;
Instruction *WideDef = DU.WideDef;		Instruction *WideDef = DU.WideDef;

ExtendKind ExtKind = getExtendKind(NarrowDef);		ExtendKind ExtKind = getExtendKind(NarrowDef);

LLVM_DEBUG(dbgs() << "Cloning arithmetic IVUser: " << *NarrowUse << "\n");		LLVM_DEBUG(dbgs() << "Cloning arithmetic IVUser: " << *NarrowUse << "\n");

▲ Show 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	if (!WideAddRec.first) {
if (widenLoopCompare(DU))		if (widenLoopCompare(DU))
return nullptr;		return nullptr;

// We are here about to generate a truncate instruction that may hurt		// We are here about to generate a truncate instruction that may hurt
// performance because the scalar evolution expression computed earlier		// performance because the scalar evolution expression computed earlier
// in WideAddRec.first does not indicate a polynomial induction expression.		// in WideAddRec.first does not indicate a polynomial induction expression.
// In that case, look at the operands of the use instruction to determine		// In that case, look at the operands of the use instruction to determine
// if we can still widen the use instead of truncating its operand.		// if we can still widen the use instead of truncating its operand.
if (widenWithVariantLoadUse(DU)) {		if (widenWithVariantUse(DU)) {
widenWithVariantLoadUseCodegen(DU);		widenWithVariantUseCodegen(DU);
return nullptr;		return nullptr;
}		}

// This user does not evaluate to a recurrence after widening, so don't		// This user does not evaluate to a recurrence after widening, so don't
// follow it. Instead insert a Trunc to kill off the original use,		// follow it. Instead insert a Trunc to kill off the original use,
// eventually isolating the original narrow IV so it can be removed.		// eventually isolating the original narrow IV so it can be removed.
truncateIVUse(DU, DT, LI);		truncateIVUse(DU, DT, LI);
return nullptr;		return nullptr;
▲ Show 20 Lines • Show All 1,587 Lines • Show Last 20 Lines

llvm/test/Transforms/IndVarSimplify/iv-widen-elim-ext.ll

Show First 20 Lines • Show All 413 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body.lr.ph, %for.body
%idx.ext1 = sext i32 %mul1 to i64		%idx.ext1 = sext i32 %mul1 to i64
%add.ptr1 = getelementptr inbounds i32, i32* %in, i64 %idx.ext1		%add.ptr1 = getelementptr inbounds i32, i32* %in, i64 %idx.ext1
%5 = load i32, i32* %add.ptr1, align 4		%5 = load i32, i32* %add.ptr1, align 4
%6 = add i32 %4, %5		%6 = add i32 %4, %5
%7 = add i32 %6, %mul		%7 = add i32 %6, %mul
%cmp = icmp slt i32 %add, %length		%cmp = icmp slt i32 %add, %length
br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit		br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit
}		}

		define i32 @foo6(%struct.image* %input, i32 %length, i32* %in) {
		entry:
		%stride = getelementptr inbounds %struct.image, %struct.image* %input, i64 0, i32 1
		%0 = load i32, i32* %stride, align 4
		%cmp17 = icmp sgt i32 %length, 1
		br i1 %cmp17, label %for.body.lr.ph, label %for.cond.cleanup

		for.body.lr.ph: ; preds = %entry
		%channel = getelementptr inbounds %struct.image, %struct.image* %input, i64 0, i32 0
		br label %for.body

		for.cond.cleanup.loopexit: ; preds = %for.body
		%1 = phi i32 [ %6, %for.body ]
		br label %for.cond.cleanup

		for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
		%2 = phi i32 [ 0, %entry ], [ %1, %for.cond.cleanup.loopexit ]
		ret i32 %2

		; Extend foo4 so that any loop variants (%3 and %or) with mul/sub/add then extend will not
		; need a trunc instruction
		; CHECK: for.body:
		; CHECK-NOT: trunc
		; CHECK: [[TMP0:%.*]] = and i32 %length, %0
		; CHECK-NEXT: zext i32 [[TMP0]] to i64
		; CHECK: [[TMP1:%.]] = or i32 %length, [[TMP2:%.]]
		; CHECK-NEXT: zext i32 [[TMP1]] to i64
		for.body: ; preds = %for.body.lr.ph, %for.body
		%x.018 = phi i32 [ 1, %for.body.lr.ph ], [ %add, %for.body ]
		%add = add nuw nsw i32 %x.018, 1
		%3 = and i32 %length, %0
		%mul = mul nuw i32 %3, %add
		%idx.ext = zext i32 %mul to i64
		%add.ptr = getelementptr inbounds i32, i32* %in, i64 %idx.ext
		%4 = load i32, i32* %add.ptr, align 4
		%mul1 = mul nuw i32 %0, %add
		%idx.ext1 = zext i32 %mul1 to i64
		%add.ptr1 = getelementptr inbounds i32, i32* %in, i64 %idx.ext1
		%5 = load i32, i32* %add.ptr1, align 4
		%or = or i32 %length, %5
		%sub.or = sub nuw i32 %or, %add
		%or.ext = zext i32 %sub.or to i64
		%ptr.or = getelementptr inbounds i32, i32* %in, i64 %or.ext
		%val.or = load i32, i32* %ptr.or
		%6 = add i32 %4, %val.or
		%cmp = icmp ult i32 %add, %length
		br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit
		}