This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Utils/
-
llvm/
-
Transforms/
-
Utils/
-
UnrollLoop.h
-
lib/Transforms/
-
Transforms/
-
Scalar/
-
LoopUnrollPass.cpp
-
Utils/
-
LoopUnrollPeel.cpp
-
test/Transforms/LoopUnroll/
-
Transforms/
-
LoopUnroll/
-
peel-loop-conditions.ll

Differential D45374

[LoopUnroll] Limit peeling to conds in BBs executed on every iteration.
AbandonedPublic

Authored by fhahn on Apr 6 2018, 8:36 AM.

Download Raw Diff

Details

Reviewers

mcrosier
efriedma
mkazantsev

Summary

Peeling off iterations that only simplify a nested conditional is likely
to increase the code size without being likely to be beneficial.

Diff Detail

Event Timeline

fhahn created this revision.Apr 6 2018, 8:36 AM

fhahn mentioned this in D43876: [LoopUnroll] Peel off iterations if it makes conditions true/false..Apr 6 2018, 8:41 AM

junbuml added a subscriber: junbuml.Apr 6 2018, 9:14 AM

You get some of the benefit of peeling whether or not the condition dominates the latch: both the peeled iterations and the remaining loop become simpler because the check disappears, and presumably that codepath is executed at least some of the time. But maybe that doesn't help enough to be worth doing? I'd like to understand the motivation a bit more.

Herald added a subscriber: zzheng. · View Herald TranscriptApr 6 2018, 12:46 PM

I also don't get it. If peeling can help to deal with a comparison like this

b = -1;
for (i = 0; i < N; i++) {
  if (something) {
    if (b > 0) break;
  }
  b = i;
}

Here b > 0 does not dominate the latch, but it is still profitable to peel away 1 iteration to get rid of it.

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

javed.absar added a subscriber: javed.absar.Apr 9 2018, 3:26 AM

In D45374#1061368, @fhahn wrote:

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Prior to loop peeling the function we'd like to inlined in h264ref has a single use.

Currently, (non-simple) loop peeling will not peel a loop if it includes a function call that is likely to be inlined (i.e., is not marked with a noinline attribute, has internal linkage and has a single use). This is exactly the case we're dealing with in h264ref, except the function to be inlined isn't marked as internal until the LTO phase of compilation. Thus, one possible approach would be to defer peeling until the LTO phase.

In D45374#1061368, @fhahn wrote:

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Quick update on this Florian. I think my initial analysis was a little off or rather the regression in h264ref was actually just noise. I tried reverting r327671 (your change to peeling) on ToT this afternoon to verify the regression and now the revert itself is causing a regression. Further, after r324557 (which changed the codegen optimization level when compiling with gold) was reverted in r329458, I'm now showing that h264ref is ahead by 3.19% when comparing ToT to Clang 6.0. Thus, I not sure this change is worth pursuing if the sole purpose is to address the h264ref regression (which doesn't appear to be a real regression after all).

In D45374#1063478, @mcrosier wrote:

In D45374#1061368, @fhahn wrote:

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Prior to loop peeling the function we'd like to inlined in h264ref has a single use.

Currently, (non-simple) loop peeling will not peel a loop if it includes a function call that is likely to be inlined (i.e., is not marked with a noinline attribute, has internal linkage and has a single use). This is exactly the case we're dealing with in h264ref, except the function to be inlined isn't marked as internal until the LTO phase of compilation. Thus, one possible approach would be to defer peeling until the LTO phase.

Ignore the above comment. I wrote this earlier this morning, prior to identifying the regression was more likely just noise.

In D45374#1061368, @fhahn wrote:

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Quick update on this Florian. I think my initial analysis was a little off or rather the regression in h264ref was actually just noise. I tried reverting r327671 (your change to peeling) on ToT this afternoon to verify the regression and now the revert itself is causing a regression. Further, after r324557 (which changed the codegen optimization level when compiling with gold) was reverted in r329458, I'm now showing that h264ref is ahead by 3.19% when comparing ToT to Clang 6.0. Thus, I not sure this change is worth pursuing if the sole purpose is to address the h264ref regression (which doesn't appear to be a real regression after all).

Thanks for the update Chad. There is no need for this change then I think.

In D45374#1064283, @fhahn wrote:

Thanks for the update Chad. There is no need for this change then I think.

I tend to agree. Thanks, Florian.

Revision Contents

Path

Size

include/

llvm/

Transforms/

Utils/

UnrollLoop.h

3 lines

lib/

Transforms/

Scalar/

LoopUnrollPass.cpp

2 lines

Utils/

LoopUnrollPeel.cpp

17 lines

test/

Transforms/

LoopUnroll/

peel-loop-conditions.ll

46 lines

Diff 141354

include/llvm/Transforms/Utils/UnrollLoop.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	bool UnrollRuntimeLoopRemainder(Loop *L, unsigned Count,
bool UseEpilogRemainder, bool UnrollRemainder,		bool UseEpilogRemainder, bool UnrollRemainder,
LoopInfo *LI,		LoopInfo *LI,
ScalarEvolution SE, DominatorTree DT,		ScalarEvolution SE, DominatorTree DT,
AssumptionCache *AC,		AssumptionCache *AC,
bool PreserveLCSSA);		bool PreserveLCSSA);

void computePeelCount(Loop *L, unsigned LoopSize,		void computePeelCount(Loop *L, unsigned LoopSize,
TargetTransformInfo::UnrollingPreferences &UP,		TargetTransformInfo::UnrollingPreferences &UP,
unsigned &TripCount, ScalarEvolution &SE);		unsigned &TripCount, ScalarEvolution &SE,
		DominatorTree &DT);

bool canPeel(Loop *L);		bool canPeel(Loop *L);

bool peelLoop(Loop L, unsigned PeelCount, LoopInfo LI, ScalarEvolution *SE,		bool peelLoop(Loop L, unsigned PeelCount, LoopInfo LI, ScalarEvolution *SE,
DominatorTree DT, AssumptionCache AC, bool PreserveLCSSA);		DominatorTree DT, AssumptionCache AC, bool PreserveLCSSA);

MDNode GetUnrollMetadata(MDNode LoopID, StringRef Name);		MDNode GetUnrollMetadata(MDNode LoopID, StringRef Name);

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H		#endif // LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 790 Lines • ▼ Show 20 Lines	if (getUnrolledLoopSize(LoopSize, UP) < UP.Threshold) {
TripMultiple = UP.UpperBound ? 1 : TripMultiple;		TripMultiple = UP.UpperBound ? 1 : TripMultiple;
return ExplicitUnroll;		return ExplicitUnroll;
}		}
}		}
}		}
}		}

// 4th priority is loop peeling		// 4th priority is loop peeling
computePeelCount(L, LoopSize, UP, TripCount, SE);		computePeelCount(L, LoopSize, UP, TripCount, SE, DT);
if (UP.PeelCount) {		if (UP.PeelCount) {
UP.Runtime = false;		UP.Runtime = false;
UP.Count = 1;		UP.Count = 1;
return ExplicitUnroll;		return ExplicitUnroll;
}		}

// 5th priority is partial unrolling.		// 5th priority is partial unrolling.
// Try partial unroll only when TripCount could be staticaly calculated.		// Try partial unroll only when TripCount could be staticaly calculated.
▲ Show 20 Lines • Show All 553 Lines • Show Last 20 Lines

lib/Transforms/Utils/LoopUnrollPeel.cpp

Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines
// body true/false. For example, if we peel 2 iterations off the loop below,		// body true/false. For example, if we peel 2 iterations off the loop below,
// the condition i < 2 can be evaluated at compile time.		// the condition i < 2 can be evaluated at compile time.
// for (i = 0; i < n; i++)		// for (i = 0; i < n; i++)
// if (i < 2)		// if (i < 2)
// ..		// ..
// else		// else
// ..		// ..
// }		// }
		// It only considers conditions in blocks that are executed on every iteration.
static unsigned countToEliminateCompares(Loop &L, unsigned MaxPeelCount,		static unsigned countToEliminateCompares(Loop &L, unsigned MaxPeelCount,
ScalarEvolution &SE) {		ScalarEvolution &SE,
		DominatorTree &DT) {
assert(L.isLoopSimplifyForm() && "Loop needs to be in loop simplify form");		assert(L.isLoopSimplifyForm() && "Loop needs to be in loop simplify form");
unsigned DesiredPeelCount = 0;		unsigned DesiredPeelCount = 0;

		BasicBlock *LoopLatch = L.getLoopLatch();
for (auto *BB : L.blocks()) {		for (auto *BB : L.blocks()) {
auto *BI = dyn_cast<BranchInst>(BB->getTerminator());		auto *BI = dyn_cast<BranchInst>(BB->getTerminator());
if (!BI \|\| BI->isUnconditional())		if (!BI \|\| BI->isUnconditional())
continue;		continue;

// Ignore loop exit condition.		// Ignore loop exit condition and blocks that are not executed on every
if (L.getLoopLatch() == BB)		// iteration.
		if (LoopLatch == BB \|\| !DT.dominates(BB, LoopLatch))
continue;		continue;

Value *Condition = BI->getCondition();		Value *Condition = BI->getCondition();
Value LeftVal, RightVal;		Value LeftVal, RightVal;
CmpInst::Predicate Pred;		CmpInst::Predicate Pred;
if (!match(Condition, m_ICmp(Pred, m_Value(LeftVal), m_Value(RightVal))))		if (!match(Condition, m_ICmp(Pred, m_Value(LeftVal), m_Value(RightVal))))
continue;		continue;

▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	static unsigned countToEliminateCompares(Loop &L, unsigned MaxPeelCount,
}		}

return DesiredPeelCount;		return DesiredPeelCount;
}		}

// Return the number of iterations we want to peel off.		// Return the number of iterations we want to peel off.
void llvm::computePeelCount(Loop *L, unsigned LoopSize,		void llvm::computePeelCount(Loop *L, unsigned LoopSize,
TargetTransformInfo::UnrollingPreferences &UP,		TargetTransformInfo::UnrollingPreferences &UP,
unsigned &TripCount, ScalarEvolution &SE) {		unsigned &TripCount, ScalarEvolution &SE,
		DominatorTree &DT) {
assert(LoopSize > 0 && "Zero loop size is not allowed!");		assert(LoopSize > 0 && "Zero loop size is not allowed!");
// Save the UP.PeelCount value set by the target in		// Save the UP.PeelCount value set by the target in
// TTI.getUnrollingPreferences or by the flag -unroll-peel-count.		// TTI.getUnrollingPreferences or by the flag -unroll-peel-count.
unsigned TargetPeelCount = UP.PeelCount;		unsigned TargetPeelCount = UP.PeelCount;
UP.PeelCount = 0;		UP.PeelCount = 0;
if (!canPeel(L))		if (!canPeel(L))
return;		return;

Show All 37 Lines	for (auto BI = L->getHeader()->begin(); isa<PHINode>(&*BI); ++BI) {
if (ToInvariance != InfiniteIterationsToInvariance)		if (ToInvariance != InfiniteIterationsToInvariance)
DesiredPeelCount = std::max(DesiredPeelCount, ToInvariance);		DesiredPeelCount = std::max(DesiredPeelCount, ToInvariance);
}		}

// Pay respect to limitations implied by loop size and the max peel count.		// Pay respect to limitations implied by loop size and the max peel count.
unsigned MaxPeelCount = UnrollPeelMaxCount;		unsigned MaxPeelCount = UnrollPeelMaxCount;
MaxPeelCount = std::min(MaxPeelCount, UP.Threshold / LoopSize - 1);		MaxPeelCount = std::min(MaxPeelCount, UP.Threshold / LoopSize - 1);

DesiredPeelCount = std::max(DesiredPeelCount,		DesiredPeelCount = std::max(
countToEliminateCompares(*L, MaxPeelCount, SE));		DesiredPeelCount, countToEliminateCompares(*L, MaxPeelCount, SE, DT));

if (DesiredPeelCount > 0) {		if (DesiredPeelCount > 0) {
DesiredPeelCount = std::min(DesiredPeelCount, MaxPeelCount);		DesiredPeelCount = std::min(DesiredPeelCount, MaxPeelCount);
// Consider max peel count limitation.		// Consider max peel count limitation.
assert(DesiredPeelCount > 0 && "Wrong loop size estimation?");		assert(DesiredPeelCount > 0 && "Wrong loop size estimation?");
DEBUG(dbgs() << "Peel " << DesiredPeelCount << " iteration(s) to turn"		DEBUG(dbgs() << "Peel " << DesiredPeelCount << " iteration(s) to turn"
<< " some Phis into invariants.\n");		<< " some Phis into invariants.\n");
UP.PeelCount = DesiredPeelCount;		UP.PeelCount = DesiredPeelCount;
▲ Show 20 Lines • Show All 370 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/peel-loop-conditions.ll

Show First 20 Lines • Show All 600 Lines • ▼ Show 20 Lines	for.inc:
%inc = add nsw i32 %i.05, 2		%inc = add nsw i32 %i.05, 2
%j.inc = add nsw i32 %j, 1		%j.inc = add nsw i32 %j, 1
%cmp = icmp slt i32 %inc, %k		%cmp = icmp slt i32 %inc, %k
br i1 %cmp, label %for.body, label %for.end		br i1 %cmp, label %for.body, label %for.end

for.end:		for.end:
ret void		ret void
}		}

		define void @test11(i32 %k) {
		; CHECK-LABEL: @test11(
		; CHECK-NEXT: for.body.lr.ph:
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
		; CHECK: for.body:
		; CHECK-NEXT: [[I_05:%.]] = phi i32 [ 0, [[FOR_BODY_LR_PH:%.]] ], [ [[INC:%.]], [[FOR_INC:%.]] ]
		; CHECK-NEXT: [[CMP1:%.]] = icmp ult i32 [[I_05]], [[K:%.]]
		; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[FOR_INC]]
		; CHECK: if.then:
		; CHECK-NEXT: [[CMP2:%.*]] = icmp ugt i32 [[I_05]], 2
		; CHECK-NEXT: br i1 [[CMP2]], label [[INNER_THEN:%.*]], label [[FOR_INC]]
		; CHECK: inner.then:
		; CHECK-NEXT: call void @f1()
		; CHECK-NEXT: br label [[FOR_INC]]
		; CHECK: for.inc:
		; CHECK-NEXT: [[INC]] = add nsw i32 [[I_05]], 1
		; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[INC]], [[K]]
		; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END:%.*]]
		; CHECK: for.end:
		; CHECK-NEXT: ret void
		;
		for.body.lr.ph:
		br label %for.body

		for.body:
		%i.05 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
		%cmp1 = icmp ult i32 %i.05, %k
		br i1 %cmp1, label %if.then, label %for.inc

		if.then:
		%cmp2 = icmp ugt i32 %i.05, 2
		br i1 %cmp2, label %inner.then, label %for.inc

		inner.then:
		call void @f1()
		br label %for.inc

		for.inc:
		%inc = add nsw i32 %i.05, 1
		%cmp = icmp slt i32 %inc, %k
		br i1 %cmp, label %for.body, label %for.end

		for.end:
		ret void
		}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Limit peeling to conds in BBs executed on every iteration.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 141354

include/llvm/Transforms/Utils/UnrollLoop.h

lib/Transforms/Scalar/LoopUnrollPass.cpp

lib/Transforms/Utils/LoopUnrollPeel.cpp

test/Transforms/LoopUnroll/peel-loop-conditions.ll

[LoopUnroll] Limit peeling to conds in BBs executed on every iteration.
AbandonedPublic