This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
-
LoopUnrollAndJamPass.cpp
-
test/Transforms/LoopUnrollAndJam/
-
Transforms/
-
LoopUnrollAndJam/
-
pragma-explicit.ll

Differential D50075

[UnJ] Improve explicit loop count checks
ClosedPublic

Authored by dmgreen on Jul 31 2018, 8:54 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
hfinkel
ddibyend
Meinersbur

Commits

rGf7111d1ecef1: [UnJ] Improve explicit loop count checks
rL339501: [UnJ] Improve explicit loop count checks

Summary

Try to improve the computed counts when it has been explicitly set by a pragma or command line option. This moves the code around, so that first call to computeUnrollCount to get a sensible count and override that if explicit unroll and jam counts are specified.

Also added some extra debug messages for when unroll and jamming is disabled.

Diff Detail

Repository: rL LLVM

Event Timeline

dmgreen created this revision.Jul 31 2018, 8:54 AM

Herald added a subscriber: zzheng. · View Herald TranscriptJul 31 2018, 8:54 AM

I find the summary confusing. Can we say that simple unrolling will be prioritized over llvm.loop.unroll_and_jam.count metadata and -unroll-and-jam-count= cmdline options?

The test case does not seem to test this. The comment mentions it expects unroll-and-jam to happen, instead of prioritizing llvm.loop.unroll.enable.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
160 ↗	(On Diff #158291)	`OuterTripCount` is passed by-reference to `computeUnrollCount` but passed by-value to `computeUnrollAndJamCount`. It is used here and by the caller of `computeUnrollAndJamCount` (but which will use the non-updated value because of passing by-value). Is this intentional?
198 ↗	(On Diff #158291)	`ExplicitUnroll` seem to have two meanings: Once to determine whether simple unroll is enabled, second to determine whether Unroll-And-Jam is enabled. Could we rename this to `ExplicitUnrollAndJam`?
213–216 ↗	(On Diff #158291)	What is supposed to happen with `llvm.loop.unroll_and_jam.enable` set, but not `llvm.loop.unroll_and_jam.count`? My expectation would be that the count is determined automatically, including considering `UP.UnrollAndJamInnerLoopThreshold` which would be skipped here.

Thanks for taking a look. The problem I was trying to solve here was the runtime thresholds preventing UnJ even if an explicit count was set.

There are other pragma tests in pragma.ll, which check combinations of unroll and unroll_and_jam pragmas. The current behaviour if there is both unroll metadata and unroll_and_jam metadata isn't currently very refined. I would expect, at least in the default pipeline, for the unroll metadata to be handled first in one of the early unroll passes.

Let me know if you think I should change how it works here.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
160 ↗	(On Diff #158291)	I believe computeUnrollCount will only update TripCount if it is fully unrolling, and only update it to either TripCount (itself) or MaxTripCount. The second shouldn't be used as we are setting it to 0 here, so OuterTripCount shouldn't be updated. I've removed the OuterTripCount use below in this function, I think the increased threshold should apply for partial loops too.
213–216 ↗	(On Diff #158291)	Yep, Good point. The pragma threshold is very high, but should still be checked. I've tried to split ExplicitCount's vs other Explicits.

In D50075#1183066, @dmgreen wrote:

There are other pragma tests in pragma.ll, which check combinations of unroll and unroll_and_jam pragmas. The current behaviour if there is both unroll metadata and unroll_and_jam metadata isn't currently very refined. I would expect, at least in the default pipeline, for the unroll metadata to be handled first in one of the early unroll passes.

This is what I am trying to tackle in D49281: Ideally one must not define multiple transformations at all, but if the frontend wants multiple transformations, it has to explicitly define an order.

Let me know if you think I should change how it works here.

Except the CHECK-NOT: LGTM

test/Transforms/LoopUnrollAndJam/pragma-explicit.ll
9 ↗	(On Diff #158346)	`-NOT` tests are quite fragile. To check that there is no 4th unroll, did you consider checking e.g. the `; preds =` of the BB that follows the last unroll copy?

This revision is now accepted and ready to land.Aug 9 2018, 2:27 PM

Thanks

test/Transforms/LoopUnrollAndJam/pragma-explicit.ll
9 ↗	(On Diff #158346)	The unrolled BB's get merged, usually leaving just the first block, so unfortunately I don't think the preds will show much. I've changed things to check the outer phi nodes, where it can check that the return phi comes from %.pre60.3, in this case.

Closed by commit rL339501: [UnJ] Improve explicit loop count checks (authored by dmgreen). · Explain WhyAug 11 2018, 12:38 AM

This revision was automatically updated to reflect the committed changes.

Meinersbur added inline comments.Aug 14 2018, 3:56 PM

test/Transforms/LoopUnrollAndJam/pragma-explicit.ll
9 ↗	(On Diff #158346)	Correct, as I had to find myself in D50075. Sorry for the incorrect suggestion. I ended up checking for the number of stores (with only one in the source IR) follows-by a `CHECK-NOT: store`. This does not depend on any instruction naming. To avoid occurrences in the remainder to be matched, add a `CHECK: br i1` to match the end of the BasicBlock before the `CHECK-NOT`. For instance here: https://reviews.llvm.org/D49281#change-CH2qXAAQjJUs

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Scalar/

LoopUnrollAndJamPass.cpp

119 lines

test/

Transforms/

LoopUnrollAndJam/

pragma-explicit.ll

144 lines

Diff 160232

llvm/trunk/lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp

	Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines
	// unroll count was set explicitly.			// unroll count was set explicitly.
	static bool computeUnrollAndJamCount(			static bool computeUnrollAndJamCount(
	Loop L, Loop SubLoop, const TargetTransformInfo &TTI, DominatorTree &DT,			Loop L, Loop SubLoop, const TargetTransformInfo &TTI, DominatorTree &DT,
	LoopInfo *LI, ScalarEvolution &SE,			LoopInfo *LI, ScalarEvolution &SE,
	const SmallPtrSetImpl<const Value *> &EphValues,			const SmallPtrSetImpl<const Value *> &EphValues,
	OptimizationRemarkEmitter *ORE, unsigned OuterTripCount,			OptimizationRemarkEmitter *ORE, unsigned OuterTripCount,
	unsigned OuterTripMultiple, unsigned OuterLoopSize, unsigned InnerTripCount,			unsigned OuterTripMultiple, unsigned OuterLoopSize, unsigned InnerTripCount,
	unsigned InnerLoopSize, TargetTransformInfo::UnrollingPreferences &UP) {			unsigned InnerLoopSize, TargetTransformInfo::UnrollingPreferences &UP) {
	// Check for explicit Count from the "unroll-and-jam-count" option.			// First up use computeUnrollCount from the loop unroller to get a count
				// for unrolling the outer loop, plus any loops requiring explicit
				// unrolling we leave to the unroller. This uses UP.Threshold /
				// UP.PartialThreshold / UP.MaxCount to come up with sensible loop values.
				// We have already checked that the loop has no unroll.* pragmas.
				unsigned MaxTripCount = 0;
				bool UseUpperBound = false;
				bool ExplicitUnroll = computeUnrollCount(
				L, TTI, DT, LI, SE, EphValues, ORE, OuterTripCount, MaxTripCount,
				OuterTripMultiple, OuterLoopSize, UP, UseUpperBound);
				if (ExplicitUnroll \|\| UseUpperBound) {
				// If the user explicitly set the loop as unrolled, dont UnJ it. Leave it
				// for the unroller instead.
				LLVM_DEBUG(dbgs() << "Won't unroll-and-jam; explicit count set by "
				"computeUnrollCount\n");
				UP.Count = 0;
				return false;
				}

				// Override with any explicit Count from the "unroll-and-jam-count" option.
	bool UserUnrollCount = UnrollAndJamCount.getNumOccurrences() > 0;			bool UserUnrollCount = UnrollAndJamCount.getNumOccurrences() > 0;
	if (UserUnrollCount) {			if (UserUnrollCount) {
	UP.Count = UnrollAndJamCount;			UP.Count = UnrollAndJamCount;
	UP.Force = true;			UP.Force = true;
	if (UP.AllowRemainder &&			if (UP.AllowRemainder &&
	getUnrollAndJammedLoopSize(OuterLoopSize, UP) < UP.Threshold &&			getUnrollAndJammedLoopSize(OuterLoopSize, UP) < UP.Threshold &&
	getUnrollAndJammedLoopSize(InnerLoopSize, UP) <			getUnrollAndJammedLoopSize(InnerLoopSize, UP) <
	UP.UnrollAndJamInnerLoopThreshold)			UP.UnrollAndJamInnerLoopThreshold)
	return true;			return true;
	}			}

	// Check for unroll_and_jam pragmas			// Check for unroll_and_jam pragmas
	unsigned PragmaCount = UnrollAndJamCountPragmaValue(L);			unsigned PragmaCount = UnrollAndJamCountPragmaValue(L);
	if (PragmaCount > 0) {			if (PragmaCount > 0) {
	UP.Count = PragmaCount;			UP.Count = PragmaCount;
	UP.Runtime = true;			UP.Runtime = true;
	UP.Force = true;			UP.Force = true;
	if ((UP.AllowRemainder \|\| (OuterTripMultiple % PragmaCount == 0)) &&			if ((UP.AllowRemainder \|\| (OuterTripMultiple % PragmaCount == 0)) &&
	getUnrollAndJammedLoopSize(OuterLoopSize, UP) < UP.Threshold &&			getUnrollAndJammedLoopSize(OuterLoopSize, UP) < UP.Threshold &&
	getUnrollAndJammedLoopSize(InnerLoopSize, UP) <			getUnrollAndJammedLoopSize(InnerLoopSize, UP) <
	UP.UnrollAndJamInnerLoopThreshold)			UP.UnrollAndJamInnerLoopThreshold)
	return true;			return true;
	}			}

	// Use computeUnrollCount from the loop unroller to get a sensible count
	// for the unrolling the outer loop. This uses UP.Threshold /
	// UP.PartialThreshold / UP.MaxCount to come up with sensible loop values.
	// We have already checked that the loop has no unroll.* pragmas.
	unsigned MaxTripCount = 0;
	bool UseUpperBound = false;
	bool ExplicitUnroll = computeUnrollCount(
	L, TTI, DT, LI, SE, EphValues, ORE, OuterTripCount, MaxTripCount,
	OuterTripMultiple, OuterLoopSize, UP, UseUpperBound);
	if (ExplicitUnroll \|\| UseUpperBound) {
	// If the user explicitly set the loop as unrolled, dont UnJ it. Leave it
	// for the unroller instead.
	UP.Count = 0;
	return false;
	}

	bool PragmaEnableUnroll = HasUnrollAndJamEnablePragma(L);			bool PragmaEnableUnroll = HasUnrollAndJamEnablePragma(L);
	ExplicitUnroll = PragmaCount > 0 \|\| PragmaEnableUnroll \|\| UserUnrollCount;			bool ExplicitUnrollAndJamCount = PragmaCount > 0 \|\| UserUnrollCount;
				bool ExplicitUnrollAndJam = PragmaEnableUnroll \|\| ExplicitUnrollAndJamCount;

	// If the loop has an unrolling pragma, we want to be more aggressive with			// If the loop has an unrolling pragma, we want to be more aggressive with
	// unrolling limits.			// unrolling limits.
	if (ExplicitUnroll && OuterTripCount != 0)			if (ExplicitUnrollAndJam)
	UP.UnrollAndJamInnerLoopThreshold = PragmaUnrollAndJamThreshold;			UP.UnrollAndJamInnerLoopThreshold = PragmaUnrollAndJamThreshold;

	if (!UP.AllowRemainder && getUnrollAndJammedLoopSize(InnerLoopSize, UP) >=			if (!UP.AllowRemainder && getUnrollAndJammedLoopSize(InnerLoopSize, UP) >=
	UP.UnrollAndJamInnerLoopThreshold) {			UP.UnrollAndJamInnerLoopThreshold) {
				LLVM_DEBUG(dbgs() << "Won't unroll-and-jam; can't create remainder and "
				"inner loop too large\n");
	UP.Count = 0;			UP.Count = 0;
	return false;			return false;
	}			}

				// We have a sensible limit for the outer loop, now adjust it for the inner
				// loop and UP.UnrollAndJamInnerLoopThreshold. If the outer limit was set
				// explicitly, we want to stick to it.
				if (!ExplicitUnrollAndJamCount && UP.AllowRemainder) {
				while (UP.Count != 0 && getUnrollAndJammedLoopSize(InnerLoopSize, UP) >=
				UP.UnrollAndJamInnerLoopThreshold)
				UP.Count--;
				}

				// If we are explicitly unroll and jamming, we are done. Otherwise there are a
				// number of extra performance heuristics to check.
				if (ExplicitUnrollAndJam)
				return true;

	// If the inner loop count is known and small, leave the entire loop nest to			// If the inner loop count is known and small, leave the entire loop nest to
	// be the unroller			// be the unroller
	if (!ExplicitUnroll && InnerTripCount &&			if (InnerTripCount && InnerLoopSize * InnerTripCount < UP.Threshold) {
	InnerLoopSize * InnerTripCount < UP.Threshold) {			LLVM_DEBUG(dbgs() << "Won't unroll-and-jam; small inner loop count is "
				"being left for the unroller\n");
	UP.Count = 0;			UP.Count = 0;
	return false;			return false;
	}			}

	// We have a sensible limit for the outer loop, now adjust it for the inner
	// loop and UP.UnrollAndJamInnerLoopThreshold.
	while (UP.Count != 0 && UP.AllowRemainder &&
	getUnrollAndJammedLoopSize(InnerLoopSize, UP) >=
	UP.UnrollAndJamInnerLoopThreshold)
	UP.Count--;

	if (!ExplicitUnroll) {
	// Check for situations where UnJ is likely to be unprofitable. Including			// Check for situations where UnJ is likely to be unprofitable. Including
	// subloops with more than 1 block.			// subloops with more than 1 block.
	if (SubLoop->getBlocks().size() != 1) {			if (SubLoop->getBlocks().size() != 1) {
				LLVM_DEBUG(
				dbgs() << "Won't unroll-and-jam; More than one inner loop block\n");
	UP.Count = 0;			UP.Count = 0;
	return false;			return false;
	}			}

	// Limit to loops where there is something to gain from unrolling and			// Limit to loops where there is something to gain from unrolling and
	// jamming the loop. In this case, look for loads that are invariant in the			// jamming the loop. In this case, look for loads that are invariant in the
	// outer loop and can become shared.			// outer loop and can become shared.
	unsigned NumInvariant = 0;			unsigned NumInvariant = 0;
	for (BasicBlock *BB : SubLoop->getBlocks()) {			for (BasicBlock *BB : SubLoop->getBlocks()) {
	for (Instruction &I : *BB) {			for (Instruction &I : *BB) {
	if (auto *Ld = dyn_cast<LoadInst>(&I)) {			if (auto *Ld = dyn_cast<LoadInst>(&I)) {
	Value *V = Ld->getPointerOperand();			Value *V = Ld->getPointerOperand();
	const SCEV *LSCEV = SE.getSCEVAtScope(V, L);			const SCEV *LSCEV = SE.getSCEVAtScope(V, L);
	if (SE.isLoopInvariant(LSCEV, L))			if (SE.isLoopInvariant(LSCEV, L))
	NumInvariant++;			NumInvariant++;
	}			}
	}			}
	}			}
	if (NumInvariant == 0) {			if (NumInvariant == 0) {
				LLVM_DEBUG(dbgs() << "Won't unroll-and-jam; No loop invariant loads\n");
	UP.Count = 0;			UP.Count = 0;
	return false;			return false;
	}			}
	}

	return ExplicitUnroll;			return false;
	}			}

	static LoopUnrollResult			static LoopUnrollResult
	tryToUnrollAndJamLoop(Loop L, DominatorTree &DT, LoopInfo LI,			tryToUnrollAndJamLoop(Loop L, DominatorTree &DT, LoopInfo LI,
	ScalarEvolution &SE, const TargetTransformInfo &TTI,			ScalarEvolution &SE, const TargetTransformInfo &TTI,
	AssumptionCache &AC, DependenceInfo &DI,			AssumptionCache &AC, DependenceInfo &DI,
	OptimizationRemarkEmitter &ORE, int OptLevel) {			OptimizationRemarkEmitter &ORE, int OptLevel) {
	// Quick checks of the correct loop form			// Quick checks of the correct loop form
	▲ Show 20 Lines • Show All 189 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopUnrollAndJam/pragma-explicit.ll

				; RUN: opt -loop-unroll-and-jam -allow-unroll-and-jam -unroll-runtime -unroll-partial-threshold=60 < %s -S \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; CHECK-LABEL: function
				; The explicit metadata here should force this to be unroll and jammed 4 times (hence the %.pre60.3)
				; CHECK: %.pre = phi i8 [ %.pre60.3, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %.pre.pre, %for.cond1.preheader.us.preheader.new ]
				; CHECK: %indvars.iv.3 = phi i64 [ 0, %for.cond1.preheader.us ], [ %indvars.iv.next.3, %for.body4.us ]
				define void @function(i8* noalias nocapture %dst, i32 %dst_stride, i8* noalias nocapture readonly %src, i32 %src_stride, i32 %A, i32 %B, i32 %C, i32 %D, i32 %width, i32 %height) {
				entry:
				%idxprom = sext i32 %src_stride to i64
				%cmp52 = icmp sgt i32 %height, 0
				br i1 %cmp52, label %for.cond1.preheader.lr.ph, label %for.cond.cleanup

				for.cond1.preheader.lr.ph: ; preds = %entry
				%cmp249 = icmp sgt i32 %width, 0
				%idx.ext = sext i32 %dst_stride to i64
				br i1 %cmp249, label %for.cond1.preheader.us.preheader, label %for.cond.cleanup

				for.cond1.preheader.us.preheader: ; preds = %for.cond1.preheader.lr.ph
				%.pre.pre = load i8, i8* %src, align 1
				%wide.trip.count = zext i32 %width to i64
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%.pre = phi i8 [ %.pre60, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %.pre.pre, %for.cond1.preheader.us.preheader ]
				%srcp.056.us.pn = phi i8* [ %srcp.056.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %src, %for.cond1.preheader.us.preheader ]
				%y.055.us = phi i32 [ %inc30.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%dst.addr.054.us = phi i8* [ %add.ptr.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %dst, %for.cond1.preheader.us.preheader ]
				%srcp.056.us = getelementptr inbounds i8, i8* %srcp.056.us.pn, i64 %idxprom
				%.pre60 = load i8, i8* %srcp.056.us, align 1
				br label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%0 = phi i8 [ %.pre60, %for.cond1.preheader.us ], [ %3, %for.body4.us ]
				%1 = phi i8 [ %.pre, %for.cond1.preheader.us ], [ %2, %for.body4.us ]
				%indvars.iv = phi i64 [ 0, %for.cond1.preheader.us ], [ %indvars.iv.next, %for.body4.us ]
				%conv.us = zext i8 %1 to i32
				%mul.us = mul nsw i32 %conv.us, %A
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%arrayidx8.us = getelementptr inbounds i8, i8* %srcp.056.us.pn, i64 %indvars.iv.next
				%2 = load i8, i8* %arrayidx8.us, align 1
				%conv9.us = zext i8 %2 to i32
				%mul10.us = mul nsw i32 %conv9.us, %B
				%conv14.us = zext i8 %0 to i32
				%mul15.us = mul nsw i32 %conv14.us, %C
				%arrayidx19.us = getelementptr inbounds i8, i8* %srcp.056.us, i64 %indvars.iv.next
				%3 = load i8, i8* %arrayidx19.us, align 1
				%conv20.us = zext i8 %3 to i32
				%mul21.us = mul nsw i32 %conv20.us, %D
				%add11.us = add i32 %mul.us, 32
				%add16.us = add i32 %add11.us, %mul10.us
				%add22.us = add i32 %add16.us, %mul15.us
				%add23.us = add i32 %add22.us, %mul21.us
				%4 = lshr i32 %add23.us, 6
				%conv24.us = trunc i32 %4 to i8
				%arrayidx26.us = getelementptr inbounds i8, i8* %dst.addr.054.us, i64 %indvars.iv
				store i8 %conv24.us, i8* %arrayidx26.us, align 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us
				%add.ptr.us = getelementptr inbounds i8, i8* %dst.addr.054.us, i64 %idx.ext
				%inc30.us = add nuw nsw i32 %y.055.us, 1
				%exitcond58 = icmp eq i32 %inc30.us, %height
				br i1 %exitcond58, label %for.cond.cleanup, label %for.cond1.preheader.us, !llvm.loop !5

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.lr.ph, %entry
				ret void
				}

				; CHECK-LABEL: function2
				; The explicit metadata here should force this to be unroll and jammed, but
				; the count is left to thresholds. In this case 2 (hence %.pre60.1).
				; CHECK: %.pre = phi i8 [ %.pre60.1, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %.pre.pre, %for.cond1.preheader.us.preheader.new ]
				; CHECK: %indvars.iv.1 = phi i64 [ 0, %for.cond1.preheader.us ], [ %indvars.iv.next.1, %for.body4.us ]
				define void @function2(i8* noalias nocapture %dst, i32 %dst_stride, i8* noalias nocapture readonly %src, i32 %src_stride, i32 %A, i32 %B, i32 %C, i32 %D, i32 %width, i32 %height) {
				entry:
				%idxprom = sext i32 %src_stride to i64
				%cmp52 = icmp sgt i32 %height, 0
				br i1 %cmp52, label %for.cond1.preheader.lr.ph, label %for.cond.cleanup

				for.cond1.preheader.lr.ph: ; preds = %entry
				%cmp249 = icmp sgt i32 %width, 0
				%idx.ext = sext i32 %dst_stride to i64
				br i1 %cmp249, label %for.cond1.preheader.us.preheader, label %for.cond.cleanup

				for.cond1.preheader.us.preheader: ; preds = %for.cond1.preheader.lr.ph
				%.pre.pre = load i8, i8* %src, align 1
				%wide.trip.count = zext i32 %width to i64
				br label %for.cond1.preheader.us

				for.cond1.preheader.us: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.us.preheader
				%.pre = phi i8 [ %.pre60, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %.pre.pre, %for.cond1.preheader.us.preheader ]
				%srcp.056.us.pn = phi i8* [ %srcp.056.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %src, %for.cond1.preheader.us.preheader ]
				%y.055.us = phi i32 [ %inc30.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
				%dst.addr.054.us = phi i8* [ %add.ptr.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %dst, %for.cond1.preheader.us.preheader ]
				%srcp.056.us = getelementptr inbounds i8, i8* %srcp.056.us.pn, i64 %idxprom
				%.pre60 = load i8, i8* %srcp.056.us, align 1
				br label %for.body4.us

				for.body4.us: ; preds = %for.body4.us, %for.cond1.preheader.us
				%0 = phi i8 [ %.pre60, %for.cond1.preheader.us ], [ %3, %for.body4.us ]
				%1 = phi i8 [ %.pre, %for.cond1.preheader.us ], [ %2, %for.body4.us ]
				%indvars.iv = phi i64 [ 0, %for.cond1.preheader.us ], [ %indvars.iv.next, %for.body4.us ]
				%conv.us = zext i8 %1 to i32
				%mul.us = mul nsw i32 %conv.us, %A
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%arrayidx8.us = getelementptr inbounds i8, i8* %srcp.056.us.pn, i64 %indvars.iv.next
				%2 = load i8, i8* %arrayidx8.us, align 1
				%conv9.us = zext i8 %2 to i32
				%mul10.us = mul nsw i32 %conv9.us, %B
				%conv14.us = zext i8 %0 to i32
				%mul15.us = mul nsw i32 %conv14.us, %C
				%arrayidx19.us = getelementptr inbounds i8, i8* %srcp.056.us, i64 %indvars.iv.next
				%3 = load i8, i8* %arrayidx19.us, align 1
				%conv20.us = zext i8 %3 to i32
				%mul21.us = mul nsw i32 %conv20.us, %D
				%add11.us = add i32 %mul.us, 32
				%add16.us = add i32 %add11.us, %mul10.us
				%add22.us = add i32 %add16.us, %mul15.us
				%add23.us = add i32 %add22.us, %mul21.us
				%4 = lshr i32 %add23.us, 6
				%conv24.us = trunc i32 %4 to i8
				%arrayidx26.us = getelementptr inbounds i8, i8* %dst.addr.054.us, i64 %indvars.iv
				store i8 %conv24.us, i8* %arrayidx26.us, align 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us

				for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us
				%add.ptr.us = getelementptr inbounds i8, i8* %dst.addr.054.us, i64 %idx.ext
				%inc30.us = add nuw nsw i32 %y.055.us, 1
				%exitcond58 = icmp eq i32 %inc30.us, %height
				br i1 %exitcond58, label %for.cond.cleanup, label %for.cond1.preheader.us, !llvm.loop !7

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.lr.ph, %entry
				ret void
				}

				!5 = distinct !{!5, !6}
				!6 = !{!"llvm.loop.unroll_and_jam.count", i32 4}
				!7 = distinct !{!7, !8}
				!8 = !{!"llvm.loop.unroll_and_jam.enable"}