Download Raw Diff

Details

Reviewers

Commits

rG19a08e42a816: [ARM] Enable partial and runtime unrolling
rL308956: [ARM] Enable partial and runtime unrolling

Summary

Enable runtime and partial loop unrolling of simple loops without calls, on Thumb2 M-Class cores. The thresholds are calculated based on the Subtarget's issue width.

Diff Detail

Repository: rL LLVM

Event Timeline

samparker created this revision.Jun 26 2017, 7:33 AM

Herald added subscribers: kristof.beyls, javed.absar, aemerson. · View Herald TranscriptJun 26 2017, 7:33 AM

Hi Sam,
Where are the tests? ;-) The ad hoc way of calculating the trip count (or the variable) in getTripCountValue() doesn't feel entirely right, is this not what SCEV should be providing?
Cheers,
Sjoerd.

Forgot to say that it would be good if you can share some numbers why this a good thing to do.

Bah, of course! I will get some tests together. As for the trip count.... yes, I think SCEV handles these things and I was hoping someone could point me to a nicer and more standardised way of getting the value.

cheers,
sam

Added some tests and changed the initial thresholds to be based purely on the issue width of the cpu, which I've added a helper function for in ARMSubtarget.

samparker edited the summary of this revision. (Show Details)Jun 27 2017, 5:27 AM

Hi Sjoerd,

I've been running an industry standard benchmark suite and targeting Cortex-M4 and Cortex-M7. This patch gives modest improvements to many of the benchmarks, as well as some significantly better results. Overall improvements are ~1% for Cortex-M4 and ~3% for Cortex-M7.

cheers,
sam

efriedma added a subscriber: efriedma.Jun 27 2017, 12:29 PM

efriedma added inline comments.

lib/Target/ARM/ARMTargetTransformInfo.cpp
599 ↗	(On Diff #104139)	These heuristics seem really weird... why does the loop depth affect the performance of a loop? Why does the trip count of the parent loop affect the performance of a loop? I mean, I can imagine these heuristics are profitable for your particular benchmarks, but it seems like a performance trap: for example, someone turns on LTO, so a function gets inlined into a loop, so we decide it's no longer profitable to unroll the inner loop, and we lose a bunch of performance.

samparker added inline comments.Jun 28 2017, 1:40 AM

lib/Target/ARM/ARMTargetTransformInfo.cpp
599 ↗	(On Diff #104139)	Hi Eli, The depth check here is just for an early exit. For nested loops, I'm checking whether the trip count of the inner loop will be the same for each iteration of the outer loops, it's not the actual trip count that affects performance. I have found that its not profitable to unroll loops with calls because it can prevent the inlining from happening, so unrolling will not happen in those cases and I hope that the inlining does occur. Some inlining could push the cost above the unrolling threshold, but that doesn't matter because we're still getting the inlining performance gains. The thresholds have been chosen to unroll small loops where the overhead of the backedge, on m-class cores, is significant. cheers, sam

fhahn added a subscriber: fhahn.Jun 28 2017, 1:44 AM

Amended some comments.

efriedma added inline comments.Jun 28 2017, 11:32 AM

lib/Target/ARM/ARMTargetTransformInfo.cpp
599 ↗	(On Diff #104139)	The depth check here is just for an early exit. For nested loops, I'm checking whether the trip count of the inner loop will be the same for each iteration of the outer loops, This is still problematic. You're basing the question of whether to unroll the loop based on code which is not part of the loop, which means the end result is going to be sensitive to unrelated inlining decisions. Scanning the loop for calls seems fine.
test/CodeGen/ARM/loop-unrolling.ll
152 ↗	(On Diff #104378)	Why do we want to avoid unrolling here? At first glance, this looks like it should be profitable to unroll (more scheduler freedom, avoid branches).

Updated to use the freshly introduced scalar evolution and also simplified the heuristics by not checking the loop depth.

samparker added inline comments.Jun 29 2017, 10:51 AM

test/CodeGen/ARM/loop-unrolling.ll
152 ↗	(On Diff #104378)	Enabling runtime unrolling can generate more spill code, and for small runtime counts the unrolled version may not even be executed that frequently, so I found that unrolling inner loops whose variable trip count is computed in the parent loop can cause some very significant regressions.

I should also say that even though the average performance improvement is modest, some tested benchmarks see an improvement of up to 40, 50 and 80% for Cortex-M4, M7 and M33.

I don't see how a trip count which varies for each iteration of the parent loop implies the trip count is small. I mean, it's possible it does for your particular benchmarks, but it seems unlikely to generalize to other code.

lib/Target/ARM/ARMTargetTransformInfo.cpp
585 ↗	(On Diff #104667)	There isn't any reason to check isLoopInvariant for each loop: if the count is invariant relative to the topmost loop, it will be invariant relative to every loop it contains.

Hi Eli,

Your comments make sense to me, so I ran an example to figure out if this heuristic was indeed nonsense. Here's the example kernel:

for (unsigned i = 0; i < max; ++i) {
  acc = 0;
  innerMax = dataSize - i;
  for (unsigned j = 0; j < innerMax; ++j) {
    acc += (input[j] * input[i+j]) >> scaleValue;
  }

The results in the graph show that often the unrolled version is faster, but the net affect across the data set is that unrolling is detrimental on performance. My other benchmark results also show that having this restriction doesn't negatively impact the performance, so I think that including the heuristic to prevent unrolling is valid.

cheers,
sam

Sam, interesting performance numbers! It looks like we see some strange bimodal behaviour: (big) improvements are canceled out by (big) regressions. Assuming that codegen for this function is the same here (just the value of dataSize is different), I am wondering if we are not just looking at some micro-architecture weirdness? So I don't think we can draw any conclusions from these numbers. Do we need more data points?

Yes good point. Those numbers were running on the M7, which has a branch predicator... so I will run it again on something more simple.

There are two kinds of SCEV expressions which are not invariant: SCEVAddRecExpr, and SCEVUnknown. If you have a SCEVAddRecExpr, you can show the loop nest is similar to your testcase, and I guess that might change the profitability of unrolling. If you have a SCEVUnknown, I can't see how you would conclude anything useful; there's a strong possibility the trip count is actually constant, but the compiler can't prove it because of aliasing or something like that.

After more testing, I've found that the variant inner loop check was unnecessary so I've removed that check.

efriedma added inline comments.Jul 10 2017, 3:56 PM

lib/Target/ARM/ARMTargetTransformInfo.cpp
552 ↗	(On Diff #105842)	If you want this to be a no-op for other architectures, shouldn't this be "return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP)"?
559 ↗	(On Diff #105842)	Do we really call into getUnrollingPreferences for the early unroll pass? We probably don't want to be using target-specific unroll heuristics for "createSimpleLoopUnrollPass". But I guess that's not something you need to fix in this patch.
lib/Target/ARM/ARMTargetTransformInfo.h
126 ↗	(On Diff #105842)	"override"?

samparker added inline comments.Jul 11 2017, 9:30 AM

lib/Target/ARM/ARMTargetTransformInfo.cpp
552 ↗	(On Diff #105842)	ah yes, thanks.
559 ↗	(On Diff #105842)	As far as I can see, the unroll pass functionality can only be controlled externally via the command line options, or by target hooks.
lib/Target/ARM/ARMTargetTransformInfo.h
126 ↗	(On Diff #105842)	oddly no, it's not virtual.

Added the default call to the base implementation.

efriedma added inline comments.Jul 11 2017, 10:50 AM

test/CodeGen/ARM/loop-unrolling.ll
1 ↗	(On Diff #106052)	We usually put tests for IR transforms into test/Transforms/; for unrolling in particular, that would be something like test/Transforms/LoopUnroll/ARM/ (you'll have to create a new directory, and make a lit.local.cfg so the test is only enabled if we have an ARM target). And I'd like to see some basic test to show that this doesn't change the unroll heuristics on, for example, cortex-a57 (where BasicTTIImplBase::getUnrollingPreferences currently changes the unroll threshold).

I've moved the test to the suggested directory and added some more targets to it. I've also now enabled the unrolling for Thumb targets and removed the use of the issue width in the calculation.

Now added the test config file!

efriedma added inline comments.Jul 17 2017, 5:39 PM

test/Transforms/LoopUnroll/ARM/loop-unrolling.ll
5 ↗	(On Diff #106445)	Could you add a run to check -mcpu=cortex-a57, to verify this we're still inheriting unroller heuristics from BasicTTIImplBase?

Hi Eli,

I've updated the tests to target the A57 for both A32 and T32 mode.

I've actually now benchmarked the A53 and A57 with this patch and the results are favourable, but there is a bug I need to fix first before I enable it. That will be another patch.

cheers,
sam

LGTM

This revision is now accepted and ready to land.Jul 19 2017, 12:26 PM

Closed by commit rL308956: [ARM] Enable partial and runtime unrolling (authored by sam_parker). · Explain WhyJul 25 2017, 1:52 AM

This revision was automatically updated to reflect the committed changes.

Diff 108020

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	public:

int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace, const Instruction *I = nullptr);		unsigned AddressSpace, const Instruction *I = nullptr);

int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,		int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
ArrayRef<unsigned> Indices, unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace);		unsigned AddressSpace);

		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
		TTI::UnrollingPreferences &UP);

bool shouldBuildLookupTablesForConstant(Constant *C) const {		bool shouldBuildLookupTablesForConstant(Constant *C) const {
// In the ROPI and RWPI relocation models we can't have pointers to global		// In the ROPI and RWPI relocation models we can't have pointers to global
// variables or functions in constant data, so don't convert switches to		// variables or functions in constant data, so don't convert switches to
// lookup tables if any of the values would need relocation.		// lookup tables if any of the values would need relocation.
if (ST->isROPI() \|\| ST->isRWPI())		if (ST->isROPI() \|\| ST->isRWPI())
return !C->needsRelocation();		return !C->needsRelocation();

return true;		return true;
}		}
/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 556 Lines • ▼ Show 20 Lines	if (Factor <= TLI->getMaxSupportedInterleaveFactor() && !EltIs64Bits) {
if (NumElts % Factor == 0 &&		if (NumElts % Factor == 0 &&
TLI->isLegalInterleavedAccessType(SubVecTy, DL))		TLI->isLegalInterleavedAccessType(SubVecTy, DL))
return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL);		return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL);
}		}

return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace);		Alignment, AddressSpace);
}		}

		void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
		TTI::UnrollingPreferences &UP) {
		// Only currently enable these preferences for M-Class cores.
		if (!ST->isMClass() \|\| L->getNumBlocks() != 1)
		return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);

		// Disable loop unrolling for Oz and Os.
		UP.OptSizeThreshold = 0;
		UP.PartialOptSizeThreshold = 0;

		// Scan the loop: don't unroll loops with calls as this could prevent
		// inlining.
		BasicBlock *BB = L->getLoopLatch();
		for (auto &I : *BB) {
		if (isa<CallInst>(I) \|\| isa<InvokeInst>(I)) {
		ImmutableCallSite CS(&I);
		if (const Function *F = CS.getCalledFunction()) {
		if (!isLoweredToCall(F))
		continue;
		}
		return;
		}
		}

		// Enable partial and runtime unrolling, set the initial threshold based upon
		// the number of registers available.
		UP.Partial = true;
		UP.Runtime = true;
		UP.Threshold = ST->isThumb1Only() ? 75 : 150;
		UP.PartialThreshold = ST->isThumb1Only() ? 75 : 150;
		}

llvm/trunk/test/Transforms/LoopUnroll/ARM/lit.local.cfg

				if not 'ARM' in config.root.targets:
				config.unsupported = True

llvm/trunk/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll

				; RUN: opt -mtriple=armv7 -mcpu=cortex-a57 -loop-unroll -S %s -o - \| FileCheck %s --check-prefix=CHECK-UNROLL-V7
				; RUN: opt -mtriple=thumbv7 -mcpu=cortex-a57 -loop-unroll -S %s -o - \| FileCheck %s --check-prefix=CHECK-UNROLL-V7
				; RUN: opt -mtriple=thumbv8m -mcpu=cortex-m23 -loop-unroll -S %s -o - \| FileCheck %s --check-prefix=CHECK-UNROLL-SMALL
				; RUN: opt -mtriple=thumbv7m -mcpu=cortex-m4 -loop-unroll -S %s -o - \| FileCheck %s --check-prefix=CHECK-UNROLL
				; RUN: opt -mtriple=thumbv8m.main -mcpu=cortex-m33 -loop-unroll -S %s -o - \| FileCheck %s --check-prefix=CHECK-UNROLL

				; CHECK-LABEL: partial
				define arm_aapcs_vfpcc void @partial(i32* nocapture %C, i32* nocapture readonly %A, i32* nocapture readonly %B) local_unnamed_addr #0 {
				entry:
				br label %for.body

				; CHECK-LABEL: for.body
				for.body:

				; CHECK-UNROLL-V7: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, %entry ], [ [[IV2:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL-V7: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL-V7: [[IV2]] = add nuw nsw i32 [[IV1]], 1
				; CHECK-UNROLL-V7: [[CMP:%[a-z.0-9]+]] = icmp eq i32 [[IV2]], 1024
				; CHECK-UNROLL-V7: br i1 [[CMP]], label [[END:%[a-z.]+]], label %for.body

				; CHECK-UNROLL-SMALL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, %entry ], [ [[IV8:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL-SMALL: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL-SMALL: [[IV2:%[a-z.0-9]+]] = add nuw nsw i32 [[IV1]], 1
				; CHECK-UNROLL-SMALL: [[IV3:%[a-z.0-9]+]] = add nuw nsw i32 [[IV2]], 1
				; CHECK-UNROLL-SMALL: [[IV4:%[a-z.0-9]+]] = add nuw nsw i32 [[IV3]], 1
				; CHECK-UNROLL-SMALL: [[IV5:%[a-z.0-9]+]] = add nuw nsw i32 [[IV4]], 1
				; CHECK-UNROLL-SMALL: [[IV6:%[a-z.0-9]+]] = add nuw nsw i32 [[IV5]], 1
				; CHECK-UNROLL-SMALL: [[IV7:%[a-z.0-9]+]] = add nuw nsw i32 [[IV6]], 1
				; CHECK-UNROLL-SMALL: [[IV8]] = add nuw nsw i32 [[IV7]], 1
				; CHECK-UNROLL-SMALL: [[CMP:%[a-z.0-9]+]] = icmp eq i32 [[IV8]], 1024
				; CHECK-UNROLL-SMALL: br i1 [[CMP]], label [[END:%[a-z.]+]], label %for.body

				; CHECK-UNROLL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, %entry ], [ [[IV16:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL: [[IV2:%[a-z.0-9]+]] = add nuw nsw i32 [[IV1]], 1
				; CHECK-UNROLL: [[IV3:%[a-z.0-9]+]] = add nuw nsw i32 [[IV2]], 1
				; CHECK-UNROLL: [[IV4:%[a-z.0-9]+]] = add nuw nsw i32 [[IV3]], 1
				; CHECK-UNROLL: [[IV5:%[a-z.0-9]+]] = add nuw nsw i32 [[IV4]], 1
				; CHECK-UNROLL: [[IV6:%[a-z.0-9]+]] = add nuw nsw i32 [[IV5]], 1
				; CHECK-UNROLL: [[IV7:%[a-z.0-9]+]] = add nuw nsw i32 [[IV6]], 1
				; CHECK-UNROLL: [[IV8:%[a-z.0-9]+]] = add nuw nsw i32 [[IV7]], 1
				; CHECK-UNROLL: [[IV9:%[a-z.0-9]+]] = add nuw nsw i32 [[IV8]], 1
				; CHECK-UNROLL: [[IV10:%[a-z.0-9]+]] = add nuw nsw i32 [[IV9]], 1
				; CHECK-UNROLL: [[IV11:%[a-z.0-9]+]] = add nuw nsw i32 [[IV10]], 1
				; CHECK-UNROLL: [[IV12:%[a-z.0-9]+]] = add nuw nsw i32 [[IV11]], 1
				; CHECK-UNROLL: [[IV13:%[a-z.0-9]+]] = add nuw nsw i32 [[IV12]], 1
				; CHECK-UNROLL: [[IV14:%[a-z.0-9]+]] = add nuw nsw i32 [[IV13]], 1
				; CHECK-UNROLL: [[IV15:%[a-z.0-9]+]] = add nuw nsw i32 [[IV14]], 1
				; CHECK-UNROLL: [[IV16]] = add nuw nsw i32 [[IV15]], 1
				; CHECK-UNROLL: [[CMP:%[a-z.0-9]+]] = icmp eq i32 [[IV16]], 1024
				; CHECK-UNROLL: br i1 [[CMP]], label [[END:%[a-z.]+]], label %for.body

				%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.08
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %B, i32 %i.08
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i32 %i.08
				store i32 %mul, i32* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %inc, 1024
				br i1 %exitcond, label %for.cond.cleanup, label %for.body

				for.cond.cleanup:
				ret void
				}

				; CHECK-LABEL: runtime
				define arm_aapcs_vfpcc void @runtime(i32* nocapture %C, i32* nocapture readonly %A, i32* nocapture readonly %B, i32 %N) local_unnamed_addr #0 {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				br i1 %cmp8, label %for.cond.cleanup, label %for.body

				; CHECK-LABEL: for.body
				for.body:
				; CHECK-UNROLL-V7: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, [[PRE:%[a-z.0-9]+]] ], [ [[IV8:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL-V7: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL-V7: [[IV2]] = add nuw i32 [[IV1]], 1

				; CHECK-UNROLL-SMALL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, [[PRE:%[a-z.0-9]+]] ], [ [[IV8:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL-SMALL: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL-SMALL: [[IV2:%[a-z.0-9]+]] = add nuw nsw i32 [[IV1]], 1
				; CHECK-UNROLL-SMALL: [[IV3:%[a-z.0-9]+]] = add nuw nsw i32 [[IV2]], 1
				; CHECK-UNROLL-SMALL: [[IV4:%[a-z.0-9]+]] = add nuw nsw i32 [[IV3]], 1
				; CHECK-UNROLL-SMALL: [[IV5:%[a-z.0-9]+]] = add nuw nsw i32 [[IV4]], 1
				; CHECK-UNROLL-SMALL: [[IV6:%[a-z.0-9]+]] = add nuw nsw i32 [[IV5]], 1
				; CHECK-UNROLL-SMALL: [[IV7:%[a-z.0-9]+]] = add nuw nsw i32 [[IV6]], 1
				; CHECK-UNROLL-SMALL: [[IV8]] = add nuw i32 [[IV7]], 1

				; CHECK-UNROLL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, [[PRE:%[a-z.0-9]+]] ], [ [[IV8:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL: [[IV2:%[a-z.0-9]+]] = add nuw nsw i32 [[IV1]], 1
				; CHECK-UNROLL: [[IV3:%[a-z.0-9]+]] = add nuw nsw i32 [[IV2]], 1
				; CHECK-UNROLL: [[IV4:%[a-z.0-9]+]] = add nuw nsw i32 [[IV3]], 1
				; CHECK-UNROLL: [[IV5:%[a-z.0-9]+]] = add nuw nsw i32 [[IV4]], 1
				; CHECK-UNROLL: [[IV6:%[a-z.0-9]+]] = add nuw nsw i32 [[IV5]], 1
				; CHECK-UNROLL: [[IV7:%[a-z.0-9]+]] = add nuw nsw i32 [[IV6]], 1
				; CHECK-UNROLL: [[IV8]] = add nuw i32 [[IV7]], 1

				%i.09 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.09
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %B, i32 %i.09
				%1 = load i32, i32* %arrayidx1, align 4
				%mul = mul nsw i32 %1, %0
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i32 %i.09
				store i32 %mul, i32* %arrayidx2, align 4
				%inc = add nuw i32 %i.09, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.cond.cleanup, label %for.body

				for.cond.cleanup:
				ret void
				}

				; CHECK-LABEL: nested_runtime
				define arm_aapcs_vfpcc void @nested_runtime(i32* nocapture %C, i16* nocapture readonly %A, i16* nocapture readonly %B, i32 %N) local_unnamed_addr #0 {
				entry:
				%cmp25 = icmp eq i32 %N, 0
				br i1 %cmp25, label %for.cond.cleanup, label %for.body4.lr.ph

				for.body4.lr.ph:
				%h.026 = phi i32 [ %inc11, %for.cond.cleanup3 ], [ 0, %entry ]
				%mul = mul i32 %h.026, %N
				br label %for.body4

				for.cond.cleanup:
				ret void

				for.cond.cleanup3:
				%inc11 = add nuw i32 %h.026, 1
				%exitcond27 = icmp eq i32 %inc11, %N
				br i1 %exitcond27, label %for.cond.cleanup, label %for.body4.lr.ph

				; CHECK-LABEL: for.body4
				for.body4:
				; CHECK-UNROLL-V7: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, [[PRE:%[a-z0-9.]+]] ], [ [[IV1:%[a-z.0-9]+]], %for.body4 ]
				; CHECK-UNROLL-V7: [[IV1:%[a-z.0-9]+]] = add nuw i32 [[IV0]], 1

				; CHECK-UNROLL-SMALL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, [[PRE:%[a-z0-9.]+]] ], [ [[IV4:%[a-z.0-9]+]], %for.body4 ]
				; CHECK-UNROLL-SMALL: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL-SMALL: [[IV2:%[a-z.0-9]+]] = add nuw nsw i32 [[IV1]], 1
				; CHECK-UNROLL-SMALL: [[IV3:%[a-z.0-9]+]] = add nuw nsw i32 [[IV2]], 1
				; CHECK-UNROLL-SMALL: [[IV4]] = add nuw i32 [[IV3]], 1

				; CHECK-UNROLL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, [[PRE:%[a-z0-9.]+]] ], [ [[IV8:%[a-z.0-9]+]], %for.body4 ]
				; CHECK-UNROLL: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL: [[IV2:%[a-z.0-9]+]] = add nuw nsw i32 [[IV1]], 1
				; CHECK-UNROLL: [[IV3:%[a-z.0-9]+]] = add nuw nsw i32 [[IV2]], 1
				; CHECK-UNROLL: [[IV4:%[a-z.0-9]+]] = add nuw nsw i32 [[IV3]], 1
				; CHECK-UNROLL: [[IV5:%[a-z.0-9]+]] = add nuw nsw i32 [[IV4]], 1
				; CHECK-UNROLL: [[IV6:%[a-z.0-9]+]] = add nuw nsw i32 [[IV5]], 1
				; CHECK-UNROLL: [[IV7:%[a-z.0-9]+]] = add nuw nsw i32 [[IV6]], 1
				; CHECK-UNROLL: [[IV8:%[a-z.0-9]+]] = add nuw i32 [[IV7]], 1

				%w.024 = phi i32 [ 0, %for.body4.lr.ph ], [ %inc, %for.body4 ]
				%add = add i32 %w.024, %mul
				%arrayidx = getelementptr inbounds i16, i16* %A, i32 %add
				%0 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %0 to i32
				%arrayidx5 = getelementptr inbounds i16, i16* %B, i32 %w.024
				%1 = load i16, i16* %arrayidx5, align 2
				%conv6 = sext i16 %1 to i32
				%mul7 = mul nsw i32 %conv6, %conv
				%arrayidx8 = getelementptr inbounds i32, i32* %C, i32 %w.024
				%2 = load i32, i32* %arrayidx8, align 4
				%add9 = add nsw i32 %mul7, %2
				store i32 %add9, i32* %arrayidx8, align 4
				%inc = add nuw i32 %w.024, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.cond.cleanup3, label %for.body4
				}

				; CHECK-LABEL: loop_call
				define arm_aapcs_vfpcc void @loop_call(i32* nocapture %C, i32* nocapture readonly %A, i32* nocapture readonly %B) local_unnamed_addr #1 {
				entry:
				br label %for.body

				for.cond.cleanup:
				ret void

				; CHECK-LABEL: for.body
				for.body:
				; CHECK-UNROLL-V7: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, %entry ], [ [[IV1:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL-V7: [[IV1]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL-V7: icmp eq i32 [[IV1]], 1024

				; CHECK-UNROLL-SMALL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, %entry ], [ [[IV1:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL-SMALL: [[IV1]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL-SMALL: icmp eq i32 [[IV1]], 1024

				; CHECK-UNROLL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, %entry ], [ [[IV1:%[a-z.0-9]+]], %for.body ]
				; CHECK-UNROLL: [[IV1]] = add nuw nsw i32 [[IV0]], 1
				; CHECK-UNROLL: icmp eq i32 [[IV1]], 1024

				%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.08
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %B, i32 %i.08
				%1 = load i32, i32* %arrayidx1, align 4
				%call = tail call arm_aapcs_vfpcc i32 @some_func(i32 %0, i32 %1) #3
				%arrayidx2 = getelementptr inbounds i32, i32* %C, i32 %i.08
				store i32 %call, i32* %arrayidx2, align 4
				%inc = add nuw nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %inc, 1024
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				declare arm_aapcs_vfpcc i32 @some_func(i32, i32) local_unnamed_addr #2

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Enable partial and runtime unrolling
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 108020

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/trunk/test/Transforms/LoopUnroll/ARM/lit.local.cfg

llvm/trunk/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Enable partial and runtime unrollingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 108020

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/trunk/test/Transforms/LoopUnroll/ARM/lit.local.cfg

llvm/trunk/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll

[ARM] Enable partial and runtime unrolling
ClosedPublic