This is an archive of the discontinued LLVM Phabricator instance.

[MVE] Don't try to unroll vectorised MVE loops
ClosedPublic

Authored by dmgreen on Aug 6 2019, 6:07 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
samparker
SjoerdMeijer
simon_tatham
ostannard

Commits

rG11c4602fce16: [MVE] Don't try to unroll vectorised MVE loops
rL368530: [MVE] Don't try to unroll vectorised MVE loops

Summary

Due to the nature of the beat system in an MVE pipeline, with tail predication and low-overhead loops, unrolling has less benefit compared to normal loops. You can not, for example, hide the latency of a load with other instructions as you can for scalar code. Not unrolling also makes the code easier to read and reason about.

So if a loop has already been vectorised, or otherwise contains vector code, don't enable the runtime unrolling. At least for the time being.

Diff Detail

Repository: rL LLVM

Event Timeline

dmgreen created this revision.Aug 6 2019, 6:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 6 2019, 6:07 AM

Herald added subscribers: zzheng, hiraditya, javed.absar. · View Herald Transcript

Rather than for performance, I can't see how we can legally unroll a vectorized loops in the case of tail predication. I also don't know how we'd detect that a loop had been unrolled when we get to the conversion phase, so it seems like this is the only place where we can maintain correctness. And because we need this for legality, we can't rely on metadata...

Hey. I'm not sure I see what you mean. Surely if unrolling cause changes that were illegal from a tail predication point of view, we would need to not do tail predication! (i.e you can manually unroll a loop and produce similar code).

These changes are aimed at performance though. For any loop that we are not tail-predicating, we should still prefer to not unroll it.

Yes, I guess that would be sensible approach! I am worried that one of the (many) passes will trip somewhere, so this gives me another example test case... For performance, I'm still not convinced this is the best approach because (1) we can't depend on metadata and (2) doesn't this also prevent the scalar remainder from being unrolled too?

(1) we can't depend on metadata

We can depend on metadata for performance though, that's what it's there for! Just not for correctness (at least, that's what I understood/remember from the language ref). The vectoriser adds llvm.loop.unroll.runtime.disable, being fairly confident that it disables the unrolling for the remainder loop, for example.

(2) doesn't this also prevent the scalar remainder from being unrolled too?

It will already have no-unroll metadata on it. It's known to be a short loop, at most 3 iterations in this case (as we vectorise x 4).

But, am I correct in saying that you are not against preventing unrolling, just this way of looking at the metadata? If so would looking through for vector instructions sounds better to you? That sounds simple enough and should produce the same effect in most cases. Let me know!

Yes, I'm definitely up for preventing unrolling and I think checking the instructions would be better - we'll catch vector intrinsics that way too.

This cleaned up nicely. It just checks return type, as I believe that will catch all interesting loops.

Ah, nice. LGTM

This revision is now accepted and ready to land.Aug 8 2019, 12:55 AM

Closed by commit rL368530: [MVE] Don't try to unroll vectorised MVE loops (authored by dmgreen). · Explain WhyAug 11 2019, 1:53 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

ARM/

ARMTargetTransformInfo.cpp

5 lines

test/

Transforms/

LoopUnroll/

ARM/

mve-nounroll.ll

127 lines

Diff 214556

llvm/trunk/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 887 Lines • ▼ Show 20 Lines	for (auto &I : *BB) {
if (isa<CallInst>(I) \|\| isa<InvokeInst>(I)) {		if (isa<CallInst>(I) \|\| isa<InvokeInst>(I)) {
ImmutableCallSite CS(&I);		ImmutableCallSite CS(&I);
if (const Function *F = CS.getCalledFunction()) {		if (const Function *F = CS.getCalledFunction()) {
if (!isLoweredToCall(F))		if (!isLoweredToCall(F))
continue;		continue;
}		}
return;		return;
}		}
		// Don't unroll vectorised loop. MVE does not benefit from it as much as
		// scalar code.
		if (I.getType()->isVectorTy())
		return;

SmallVector<const Value*, 4> Operands(I.value_op_begin(),		SmallVector<const Value*, 4> Operands(I.value_op_begin(),
I.value_op_end());		I.value_op_end());
Cost += getUserCost(&I, Operands);		Cost += getUserCost(&I, Operands);
}		}
}		}

LLVM_DEBUG(dbgs() << "Cost of loop: " << Cost << "\n");		LLVM_DEBUG(dbgs() << "Cost of loop: " << Cost << "\n");

Show All 13 Lines

llvm/trunk/test/Transforms/LoopUnroll/ARM/mve-nounroll.ll

				; RUN: opt -mtriple=thumbv8.1m.main -mattr=+mve.fp -loop-unroll -S < %s -o - \| FileCheck %s

				; CHECK-LABEL: @loopfn
				; CHECK: vector.body:
				; CHECK: br i1 %7, label %middle.block, label %vector.body, !llvm.loop !0
				; CHECK: middle.block:
				; CHECK: br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader13
				; CHECK: for.body:
				; CHECK: br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !2

				define void @loopfn(float* %s1, float* %s2, float* %d, i32 %n) {
				entry:
				%cmp10 = icmp sgt i32 %n, 0
				br i1 %cmp10, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%min.iters.check = icmp ult i32 %n, 4
				br i1 %min.iters.check, label %for.body.preheader13, label %vector.ph

				for.body.preheader13: ; preds = %middle.block, %for.body.preheader
				%i.011.ph = phi i32 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
				br label %for.body

				vector.ph: ; preds = %for.body.preheader
				%n.vec = and i32 %n, -4
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%0 = getelementptr inbounds float, float* %s1, i32 %index
				%1 = bitcast float* %0 to <4 x float>*
				%wide.load = load <4 x float>, <4 x float>* %1, align 4
				%2 = getelementptr inbounds float, float* %s2, i32 %index
				%3 = bitcast float* %2 to <4 x float>*
				%wide.load12 = load <4 x float>, <4 x float>* %3, align 4
				%4 = fadd fast <4 x float> %wide.load12, %wide.load
				%5 = getelementptr inbounds float, float* %d, i32 %index
				%6 = bitcast float* %5 to <4 x float>*
				store <4 x float> %4, <4 x float>* %6, align 4
				%index.next = add i32 %index, 4
				%7 = icmp eq i32 %index.next, %n.vec
				br i1 %7, label %middle.block, label %vector.body, !llvm.loop !0

				middle.block: ; preds = %vector.body
				%cmp.n = icmp eq i32 %n.vec, %n
				br i1 %cmp.n, label %for.cond.cleanup, label %for.body.preheader13

				for.cond.cleanup.loopexit: ; preds = %for.body
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %middle.block, %entry
				ret void

				for.body: ; preds = %for.body.preheader13, %for.body
				%i.011 = phi i32 [ %add3, %for.body ], [ %i.011.ph, %for.body.preheader13 ]
				%arrayidx = getelementptr inbounds float, float* %s1, i32 %i.011
				%8 = load float, float* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds float, float* %s2, i32 %i.011
				%9 = load float, float* %arrayidx1, align 4
				%add = fadd fast float %9, %8
				%arrayidx2 = getelementptr inbounds float, float* %d, i32 %i.011
				store float %add, float* %arrayidx2, align 4
				%add3 = add nuw nsw i32 %i.011, 1
				%exitcond = icmp eq i32 %add3, %n
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !2
				}



				; CHECK-LABEL: @nested
				; CHECK: for.outer:
				; CHECK: br label %vector.body
				; CHECK: vector.body:
				; CHECK: br i1 %8, label %for.latch, label %vector.body, !llvm.loop !0
				; CHECK: for.latch:
				; CHECK: br i1 %exitcond34, label %for.cond.cleanup.loopexit, label %for.outer

				define void @nested(float* %s1, float* %s2, float* %d, i32 %n) {
				entry:
				%cmp31 = icmp eq i32 %n, 0
				br i1 %cmp31, label %for.cond.cleanup, label %for.outer.preheader

				for.outer.preheader: ; preds = %entry
				%min.iters.check = icmp ult i32 %n, 4
				%n.vec = and i32 %n, -4
				%cmp.n = icmp eq i32 %n.vec, %n
				br label %for.outer

				for.outer: ; preds = %for.outer.preheader, %for.cond1.for.cond.cleanup3_crit_edge.us
				%j.032.us = phi i32 [ %inc.us, %for.latch ], [ 0, %for.outer.preheader ]
				%mul.us = mul i32 %j.032.us, %n
				br label %vector.body

				vector.body: ; preds = %for.outer, %vector.body
				%index = phi i32 [ %index.next, %vector.body ], [ 0, %for.outer ]
				%0 = add i32 %index, %mul.us
				%1 = getelementptr inbounds float, float* %s1, i32 %0
				%2 = bitcast float* %1 to <4 x float>*
				%wide.load = load <4 x float>, <4 x float>* %2, align 4
				%3 = getelementptr inbounds float, float* %s2, i32 %0
				%4 = bitcast float* %3 to <4 x float>*
				%wide.load35 = load <4 x float>, <4 x float>* %4, align 4
				%5 = fadd fast <4 x float> %wide.load35, %wide.load
				%6 = getelementptr inbounds float, float* %d, i32 %0
				%7 = bitcast float* %6 to <4 x float>*
				store <4 x float> %5, <4 x float>* %7, align 4
				%index.next = add i32 %index, 4
				%8 = icmp eq i32 %index.next, %n.vec
				br i1 %8, label %for.latch, label %vector.body, !llvm.loop !0

				for.latch: ; preds = %vector.body, %for.outer
				%i.030.us.ph = phi i32 [ %n.vec, %vector.body ]
				%inc.us = add nuw i32 %j.032.us, 1
				%exitcond34 = icmp eq i32 %inc.us, %n
				br i1 %exitcond34, label %for.cond.cleanup.loopexit, label %for.outer

				for.cond.cleanup.loopexit:
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
				ret void
				}

				!0 = distinct !{!0, !1}
				!1 = !{!"llvm.loop.isvectorized", i32 1}
				!2 = distinct !{!2, !3, !1}
				!3 = !{!"llvm.loop.unroll.runtime.disable"}