This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
1/5
LoopUnrollPass.cpp
-
test/Transforms/LoopUnroll/AArch64/
-
Transforms/
-
LoopUnroll/
-
AArch64/
-
unroll-optsize.ll

Differential D60266

[LoopUnroll] Rotate loop, when optimizing for size and can fully unroll a loop.
AbandonedPublic

Authored by fhahn on Apr 4 2019, 7:08 AM.

Download Raw Diff

Details

Reviewers

vsk
efriedma
dmgreen
paquette

Summary

If we can fully unroll a loop, there should be no size overhead when
rotating a loop, as a prerequisite for unrolling. Loop rotation is
conservative when optimizing for size, but we can rotate in loop-unroll,
if unrolling is beneficial.

To avoid rotating when LoopUnroll would fail, do some checks if rotating
will result in an unroll-able loop.

I measure code size on ARM64 with -Oz -flto on MultiSource, SPEC2000 and
SPEC2006. Codesize changes are as follows:

test-suite...Rodinia/backprop/backprop.test 1796.00 1752.00 -2.4%
test-suite.../Benchmarks/Olden/mst/mst.test 1436.00 1420.00 -1.1%
test-suite...006/447.dealII/447.dealII.test 256408.00 255040.00 -0.5%
test-suite...urce/Applications/lua/lua.test 88648.00 88508.00 -0.2%
test-suite.../CINT2000/176.gcc/176.gcc.test 837764.00 836520.00 -0.1%
test-suite...yApps-C++/PENNANT/PENNANT.test 40156.00 40188.00 0.1%
test-suite...rks/tramp3d-v4/tramp3d-v4.test 195420.00 195280.00 -0.1%
test-suite...0/253.perlbmk/253.perlbmk.test 335568.00 335400.00 -0.1%

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 30053
Build 30052: arc lint + arc unit

Event Timeline

fhahn created this revision.Apr 4 2019, 7:08 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2019, 7:08 AM

Herald added subscribers: dexonsmith, zzheng, hiraditya and 3 others. · View Herald Transcript

fhahn added a parent revision: D60265: [LoopUnroll] Allow unrolling if the unrolled size does not exceed loop size..Apr 4 2019, 7:09 AM

Harbormaster completed remote builds in B30053: Diff 193704.Apr 4 2019, 7:12 AM

paquette mentioned this in D60265: [LoopUnroll] Allow unrolling if the unrolled size does not exceed loop size..Apr 4 2019, 9:11 AM

paquette added inline comments.Apr 4 2019, 9:24 AM

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1096	Can you just use the OptForSize you defined earlier in the function?
1108	I feel like this comment doesn't really describe why you're avoiding loops that are unsafe to clone? Maybe a bit more detail here?
1119	This could use a comment explaining what you're doing here?
1130	I think it's probably worth factoring `L->getHeader()->getParent()` out into a variable at this point, if only to reduce the number of `->`s. :)

fhahn marked an inline comment as done.Apr 4 2019, 9:57 AM

fhahn added inline comments.

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1108	Having all of those checks here is a bit unfortunate, really, because they are basically taken as is from UnrollLoop, and the comments make more sense there IMO. I'll see if I can push the rotation logic into LoopUnroll, as an option.

Why do we need to rotate the loop before unrolling? llvm::UnrollLoop currently refuses to unroll loops where the latch is an unconditional branch, but that isn't a fundamental limitation, as far as I can tell. We already support unrolling loops where the latch is not the exit branch; allowing loops where the latch doesn't exit at all is a minor extension. Granted, it might be more efficient to explicitly rotate the loop before unrolling, so we don't clone quite so much code.

On a related note, independent of what we do in unrolling, it would probably be worthwhile to teach loop rotation to rotate loops where the cloned header would fold to zero instructions.

In D60266#1455299, @efriedma wrote:

Why do we need to rotate the loop before unrolling? llvm::UnrollLoop currently refuses to unroll loops where the latch is an unconditional branch, but that isn't a fundamental limitation, as far as I can tell. We already support unrolling loops where the latch is not the exit branch; allowing loops where the latch doesn't exit at all is a minor extension. Granted, it might be more efficient to explicitly rotate the loop before unrolling, so we don't clone quite so much code.

IIUC the only reason requiring the conditional branch in the latch is to reduce the complexity of the unrolling code. In the non-optsize case, loop-rotate should take care of them. I can have a look and see how much work it would be to lift the restriction on UnrollLoop. The initial patch was just the easiest way to get things working, to give a good idea of the potential gains.

On a related note, independent of what we do in unrolling, it would probably be worthwhile to teach loop rotation to rotate loops where the cloned header would fold to zero instructions.

Great idea, it seems there are a few potential improvements for optsize around here. I'll add it to my todo list :)

Like D59832, #pragma clang loop unroll/#pragma unroll might require a loop rotation as well to successfully unroll?

In D60266#1455299, @efriedma wrote:

On a related note, independent of what we do in unrolling, it would probably be worthwhile to teach loop rotation to rotate loops where the cloned header would fold to zero instructions.

I started looking into this and put up D61683, which uses UnrolledInstAnalyzer to simulate the simplification of the hoisted header, including a patch to move some functionality in UnrolledInstAnalyzer. It is still rough (the interface & naming needs improving) and it lead to some size improvements and regressions. I need to take a closer look at the regressions, I suspect they highlight some problems in other passes, that previously did not have to deal with a lot of loops with -Oz.

In D60266#1455299, @efriedma wrote:

Why do we need to rotate the loop before unrolling? llvm::UnrollLoop currently refuses to unroll loops where the latch is an unconditional branch, but that isn't a fundamental limitation, as far as I can tell. We already support unrolling loops where the latch is not the exit branch; allowing loops where the latch doesn't exit at all is a minor extension. Granted, it might be more efficient to explicitly rotate the loop before unrolling, so we don't clone quite so much code.

It looks like there is not too much work needed to support loops with unconditional latches with exiting headers. I've put up a patch: D61962. Still needs a bit more evaluating of the impact on code-size/performance. It is more aggressive than this patch.

The motivating case for this patch has been addressed by teaching UnrollLoop to deal with loops with unconditional latches: https://reviews.llvm.org/rL364398

Thanks for all the feedback!

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LoopUnrollPass.cpp

42 lines

test/

Transforms/

LoopUnroll/

AArch64/

unroll-optsize.ll

93 lines

Diff 193704

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include "llvm/Transforms/Scalar/LoopPassManager.h"		#include "llvm/Transforms/Scalar/LoopPassManager.h"
#include "llvm/Transforms/Utils.h"		#include "llvm/Transforms/Utils.h"
		#include "llvm/Transforms/Utils/LoopRotationUtils.h"
#include "llvm/Transforms/Utils/LoopSimplify.h"		#include "llvm/Transforms/Utils/LoopSimplify.h"
#include "llvm/Transforms/Utils/LoopUtils.h"		#include "llvm/Transforms/Utils/LoopUtils.h"
#include "llvm/Transforms/Utils/UnrollLoop.h"		#include "llvm/Transforms/Utils/UnrollLoop.h"
#include <algorithm>		#include <algorithm>
#include <cassert>		#include <cassert>
#include <cstdint>		#include <cstdint>
#include <limits>		#include <limits>
#include <string>		#include <string>
▲ Show 20 Lines • Show All 1,019 Lines • ▼ Show 20 Lines	if (!UP.Count)
return LoopUnrollResult::Unmodified;		return LoopUnrollResult::Unmodified;
// Unroll factor (Count) must be less or equal to TripCount.		// Unroll factor (Count) must be less or equal to TripCount.
if (TripCount && UP.Count > TripCount)		if (TripCount && UP.Count > TripCount)
UP.Count = TripCount;		UP.Count = TripCount;

// Save loop properties before it is transformed.		// Save loop properties before it is transformed.
MDNode *OrigLoopID = L->getLoopID();		MDNode *OrigLoopID = L->getLoopID();

		// Check if we should rotate the loop, when optimizing for size and we can
		// fully unroll the loop. We do some additional checks to make sure the loop
		// will be unroll-able after rotating, to avoid rotating without unrolling.
		auto shouldRotate = [&](Loop *L) {
		if (!L->getHeader()->getParent()->optForSize() \|\| !TripCount \|\|
		paquetteUnsubmitted Not Done Reply Inline Actions Can you just use the OptForSize you defined earlier in the function? paquette: Can you just use the OptForSize you defined earlier in the function?
		UP.Count != TripCount)
		return false;

		BasicBlock *Preheader = L->getLoopPreheader();
		if (!Preheader)
		return false;

		BasicBlock *LatchBlock = L->getLoopLatch();
		if (!LatchBlock)
		return false;

		// Loops with indirectbr cannot be cloned.
		paquetteUnsubmitted Not Done Reply Inline Actions I feel like this comment doesn't really describe why you're avoiding loops that are unsafe to clone? Maybe a bit more detail here? paquette: I feel like this comment doesn't really describe why you're avoiding loops that are unsafe to…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Having all of those checks here is a bit unfortunate, really, because they are basically taken as is from UnrollLoop, and the comments make more sense there IMO. I'll see if I can push the rotation logic into LoopUnroll, as an option. fhahn: Having all of those checks here is a bit unfortunate, really, because they are basically taken…
		if (!L->isSafeToClone())
		return false;

		BasicBlock *Header = L->getHeader();
		if (Header->hasAddressTaken())
		return false;

		BranchInst *BI = dyn_cast<BranchInst>(LatchBlock->getTerminator());
		if (!BI \|\| BI->isUnconditional()) {
		BranchInst *HeaderBI = dyn_cast<BranchInst>(Header->getTerminator());
		auto CheckSuccessors = [&](unsigned S1, unsigned S2) {
		paquetteUnsubmitted Not Done Reply Inline Actions This could use a comment explaining what you're doing here? paquette: This could use a comment explaining what you're doing here?
		return HeaderBI->getSuccessor(S1) == LatchBlock &&
		!L->contains(HeaderBI->getSuccessor(S2));
		};
		return HeaderBI->isConditional() &&
		(CheckSuccessors(0, 1) \|\| CheckSuccessors(1, 0));
		}
		return false;
		};

		if (shouldRotate(L)) {
		SimplifyQuery SQ(L->getHeader()->getParent()->getParent()->getDataLayout());
		paquetteUnsubmitted Not Done Reply Inline Actions I think it's probably worth factoring `L->getHeader()->getParent()` out into a variable at this point, if only to reduce the number of `->`s. :) paquette: I think it's probably worth factoring `L->getHeader()->getParent()` out into a variable at this…
		LoopRotation(L, LI, &TTI, &AC, &DT, &SE, nullptr, SQ, true, -1, true);
		}
// Unroll the loop.		// Unroll the loop.
Loop *RemainderLoop = nullptr;		Loop *RemainderLoop = nullptr;
LoopUnrollResult UnrollResult = UnrollLoop(		LoopUnrollResult UnrollResult = UnrollLoop(
L, UP.Count, TripCount, UP.Force, UP.Runtime, UP.AllowExpensiveTripCount,		L, UP.Count, TripCount, UP.Force, UP.Runtime, UP.AllowExpensiveTripCount,
UseUpperBound, MaxOrZero, TripMultiple, UP.PeelCount, UP.UnrollRemainder,		UseUpperBound, MaxOrZero, TripMultiple, UP.PeelCount, UP.UnrollRemainder,
LI, &SE, &DT, &AC, &ORE, PreserveLCSSA, &RemainderLoop);		LI, &SE, &DT, &AC, &ORE, PreserveLCSSA, &RemainderLoop);
if (UnrollResult == LoopUnrollResult::Unmodified)		if (UnrollResult == LoopUnrollResult::Unmodified)
return LoopUnrollResult::Unmodified;		return LoopUnrollResult::Unmodified;
▲ Show 20 Lines • Show All 324 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopUnroll/AArch64/unroll-optsize.ll

Show First 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.cond		for.cond.cleanup: ; preds = %for.cond
%ptr = bitcast [4 x i32]* %arr to i32*		%ptr = bitcast [4 x i32]* %arr to i32*
call void @use(i32* nonnull %ptr) #4		call void @use(i32* nonnull %ptr) #4
ret void		ret void
}		}

		; We need to rotate the loop in order to unroll it.
		define void @fully_unrolled_smaller_rotated() #0 {
		; CHECK-LABEL: @fully_unrolled_smaller_rotated(
		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[ARR:%.*]] = alloca [4 x i32], align 4
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
		; CHECK: for.body:
		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [4 x i32], [4 x i32] [[ARR]], i64 0, i64 0
		; CHECK-NEXT: store i32 16, i32* [[ARRAYIDX]], align 4
		; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds [4 x i32], [4 x i32] [[ARR]], i64 0, i64 1
		; CHECK-NEXT: store i32 4104, i32* [[ARRAYIDX_1]], align 4
		; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds [4 x i32], [4 x i32] [[ARR]], i64 0, i64 2
		; CHECK-NEXT: store i32 1048592, i32* [[ARRAYIDX_2]], align 4
		; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds [4 x i32], [4 x i32] [[ARR]], i64 0, i64 3
		; CHECK-NEXT: store i32 268435480, i32* [[ARRAYIDX_3]], align 4
		; CHECK-NEXT: [[PTR:%.]] = bitcast [4 x i32] [[ARR]] to i32*
		; CHECK-NEXT: call void @use(i32* nonnull [[PTR]])
		; CHECK-NEXT: ret void
		;
		entry:
		%arr = alloca [4 x i32], align 4
		br label %for.cond

		for.cond:
		%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
		%exitcond = icmp eq i64 %indvars.iv, 3
		br i1 %exitcond, label %for.cond.cleanup, label %for.body

		for.body: ; preds = %for.body, %entry
		%indvars.iv.tr = trunc i64 %indvars.iv to i32
		%shl.0 = shl i32 %indvars.iv.tr, 3
		%shl.1 = shl i32 16, %shl.0
		%or = or i32 %shl.1, %shl.0
		%arrayidx = getelementptr inbounds [4 x i32], [4 x i32]* %arr, i64 0, i64 %indvars.iv
		store i32 %or, i32* %arrayidx, align 4
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		br label %for.cond

		for.cond.cleanup: ; preds = %for.cond
		%ptr = bitcast [4 x i32]* %arr to i32*
		call void @use(i32* nonnull %ptr) #4
		ret void
		}

		define void @fully_unrolled_bigger_not_rotated() #0 {
		; CHECK-LABEL: @fully_unrolled_bigger_not_rotated(
		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[ARR:%.*]] = alloca [4 x i32], align 4
		; CHECK-NEXT: br label [[FOR_COND:%.*]]
		; CHECK: for.cond:
		; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY:%.]] ], [ 0, [[ENTRY:%.]] ]
		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV]], 6
		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
		; CHECK: for.body:
		; CHECK-NEXT: [[INDVARS_IV_TR:%.*]] = trunc i64 [[INDVARS_IV]] to i32
		; CHECK-NEXT: [[SHL_0:%.*]] = shl i32 [[INDVARS_IV_TR]], 3
		; CHECK-NEXT: [[SHL_1:%.*]] = shl i32 16, [[SHL_0]]
		; CHECK-NEXT: [[OR:%.*]] = or i32 [[SHL_1]], [[SHL_0]]
		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds [4 x i32], [4 x i32] [[ARR]], i64 0, i64 [[INDVARS_IV]]
		; CHECK-NEXT: store i32 [[OR]], i32* [[ARRAYIDX]], align 4
		; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
		; CHECK-NEXT: br label [[FOR_COND]]
		; CHECK: for.cond.cleanup:
		; CHECK-NEXT: [[PTR:%.]] = bitcast [4 x i32] [[ARR]] to i32*
		; CHECK-NEXT: call void @use(i32* nonnull [[PTR]])
		; CHECK-NEXT: ret void
		;
		entry:
		%arr = alloca [4 x i32], align 4
		br label %for.cond

		for.cond:
		%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
		%exitcond = icmp eq i64 %indvars.iv, 6
		br i1 %exitcond, label %for.cond.cleanup, label %for.body

		for.body: ; preds = %for.body, %entry
		%indvars.iv.tr = trunc i64 %indvars.iv to i32
		%shl.0 = shl i32 %indvars.iv.tr, 3
		%shl.1 = shl i32 16, %shl.0
		%or = or i32 %shl.1, %shl.0
		%arrayidx = getelementptr inbounds [4 x i32], [4 x i32]* %arr, i64 0, i64 %indvars.iv
		store i32 %or, i32* %arrayidx, align 4
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		br label %for.cond

		for.cond.cleanup: ; preds = %for.cond
		%ptr = bitcast [4 x i32]* %arr to i32*
		call void @use(i32* nonnull %ptr) #4
		ret void
		}


declare void @use(i32*)		declare void @use(i32*)

attributes #0 = { minsize optsize }		attributes #0 = { minsize optsize }