This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/CodeGen/
-
lib/
-
CodeGen/
1
MachineScheduler.cpp

Differential D45229

[MI-sched] schedule following instruction latencies
AbandonedPublic

Authored by sebpop on Apr 3 2018, 2:18 PM.

Download Raw Diff

Details

Reviewers

atrick
efriedma
evandro
mcrosier

Summary

With this patch LLVM starts scheduling instructions based on their latency.

For top-down scheduling, prefer scheduling instructions at a lower depth:
those instructions will be ready to execute before others at a higher depth.
For bottom-up scheduling, prefer scheduling instructions at a lower height.

For the following testcase:

void fun(char *restrict in, char *restrict out) {
  *in++ = *out++;
  *in++ = *out++;
  *in++ = *out++;
  *in++ = *out++;
}

on aarch64 we used to produce this code :

	ldrb	w8, [x1]
	strb	w8, [x0]
	ldrb	w8, [x1, #1]
	strb	w8, [x0, #1]
	ldrb	w8, [x1, #2]
	strb	w8, [x0, #2]
	ldrb	w8, [x1, #3]
	strb	w8, [x0, #3]

with the patch we now produce:

	ldrb	w8, [x1]
	ldrb	w9, [x1, #1]
	ldrb	w10, [x1, #2]
	ldrb	w11, [x1, #3]
	strb	w8, [x0]
	strb	w9, [x0, #1]
	strb	w10, [x0, #2]
	strb	w11, [x0, #3]

There are about 600 tests modified by this patch (mostly on the x86 side, and a few on aarch64.)

Diff Detail

Event Timeline

sebpop created this revision.Apr 3 2018, 2:18 PM

Herald added subscribers: javed.absar, hiraditya, kristof.beyls, MatzeB. · View Herald TranscriptApr 3 2018, 2:18 PM

I think we should ask backend if we want to gang up loads and stores, and if we do want to gang up loads and stores, then how many should we be ganging up together. Ganging up lots of loads may result in high register pressure.

MatzeB added a comment.Apr 3 2018, 9:27 PM

This comment was removed by MatzeB.

$ clang -arch arm64 -O3 -S -o- t.c -mcpu=generic
...
	ldrb		w8, [x1]
	ldrb	w9, [x1, #1]
	ldrb	w10, [x1, #2]
	ldrb	w11, [x1, #3]
	strb		w8, [x0]
	strb	w9, [x0, #1]
	strb	w10, [x0, #2]
	strb	w11, [x0, #3]
	ret

Are you sure you are using a good scheduling model?
A change like this needs a lot of benchmarking. In my experience real world looses all too fast when additional latency heuristics fight against register pressure. (See also the code around tryLatency()).

javed.absar added a comment.Apr 3 2018, 11:43 PM

This comment was removed by javed.absar.

Hi Sebastian:
For the test case, the change looks good, though I am surprised the current scheduling algorithm does that (i.e. scheduling store just one cycle after load which definitely would result in stalls, especially for in-order processors). I am guessing the latency it is picking up is 1?

In D45229#1056487, @MatzeB wrote:

Are you sure you are using a good scheduling model?

I have seen badly scheduled code for -mcpu=cortex-a57 and the exynos-m* tunings.

Curiously cortex-a53 has the best code after inst-combine manages to fuse all the byte ld/st.
Evandro will be investigating why this does not happen for other -mcpu tunings.

In my experience real world looses all too fast when additional latency heuristics fight against register pressure. (See also the code around tryLatency()).

Thanks for the pointer to tryLatency(): the good news is that the function
has the exact code as I wrote in this patch.
tryLatency is not executed because TryCand.Policy.ReduceLatency is false:
it is only set in the following code, so the problem may be in this logic:

// Schedule aggressively for latency in PostRA mode. We don't check for
// acyclic latency during PostRA, and highly out-of-order processors will
// skip PostRA scheduling.
if (IsPostRA || (RemLatency + CurrZone.getCurrCycle() > Rem.CriticalPath)) {
  Policy.ReduceLatency |= true;

at the beginning of scheduling we are not on the critical path, so we won't set ReduceLatency.
Then we schedule an instruction without looking at its latency, which introduces a bubble
that will then exceed the critical path and consequently force to set ReduceLatency
for the remaining instructions to be scheduled.

In D45229#1056570, @javed.absar wrote:

Hi Sebastian:
For the test case, the change looks good, though I am surprised the current scheduling algorithm does that (i.e. scheduling store just one cycle after load which definitely would result in stalls, especially for in-order processors).

See the comment that I added inline to the place that falls through to the original scheduling order.

llvm/lib/CodeGen/MachineScheduler.cpp

3018

If everything above does not early return, this will pick up the instruction with the smallest SU number in top-down scheduling, and the largest SU number for bottom up scheduling.

In the testcase compiled with -mcpu=cortex-a57 the load at SU(8) will be picked up by the bot scheduler before other roots at SU 3, 5, 7 for this "ORDER" reason.

*** Final schedule for %bb.0 ***
SU(0):   %1:gpr64common = COPY $x1
SU(1):   %0:gpr64common = COPY $x0
SU(2):   %2:gpr32 = LDRBBui %1:gpr64common, 0 :: (load 1 from %ir.out, !tbaa !2)
SU(4):   %3:gpr32 = LDRBBui %1:gpr64common, 1 :: (load 1 from %ir.incdec.ptr, !tbaa !2)
SU(6):   %4:gpr32 = LDRBBui %1:gpr64common, 2 :: (load 1 from %ir.incdec.ptr2, !tbaa !2)
SU(3):   STRBBui %2:gpr32, %0:gpr64common, 0 :: (store 1 into %ir.in, !tbaa !2)
SU(5):   STRBBui %3:gpr32, %0:gpr64common, 1 :: (store 1 into %ir.incdec.ptr1, !tbaa !2)
SU(7):   STRBBui %4:gpr32, %0:gpr64common, 2 :: (store 1 into %ir.incdec.ptr3, !tbaa !2)
SU(8):   %5:gpr32 = LDRBBui %1:gpr64common, 3 :: (load 1 from %ir.incdec.ptr4, !tbaa !2)
SU(9):   STRBBui %5:gpr32, %0:gpr64common, 3 :: (store 1 into %ir.incdec.ptr5, !tbaa !2)

SU(8) gets in the ready queue after the bottom-up schedules SU(9) who consumes the output of 8.
Scheduling 8 just before 9 is very bad as there are no other instructions to hide the latency of 8.

A while ago I put up a patch to make the scheduler slightly more aggressive with using latency D38279. I did not really have time to follow it up though.

Thanks @fhahn for the pointer: I'm closing this revision and I will try to fix the problem with something similar to D38279.
I tried that code out and it is not modifying the current behaviour of the scheduler.

dmgreen added a subscriber: dmgreen.Apr 4 2018, 1:37 PM

Great, I am looking forward to hearing what you find! @jonpa just landed r329884, which should make it easier to add target-specific MachineScheduler's with customized heuristics; basically all one has to do is to implement a custom tryCandidate function. IMO it would make it easier to experiment with heuristics if we would have a separate scheduler with experimental heuristics for AArch64.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MachineScheduler.cpp

31 lines

Diff 140861

llvm/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 2,963 Lines • ▼ Show 20 Lines	void GenericScheduler::tryCandidate(SchedCandidate &Cand,
// Avoid increasing the max pressure of the entire region.		// Avoid increasing the max pressure of the entire region.
if (DAG->isTrackingPressure() && tryPressure(TryCand.RPDelta.CurrentMax,		if (DAG->isTrackingPressure() && tryPressure(TryCand.RPDelta.CurrentMax,
Cand.RPDelta.CurrentMax,		Cand.RPDelta.CurrentMax,
TryCand, Cand, RegMax, TRI,		TryCand, Cand, RegMax, TRI,
DAG->MF))		DAG->MF))
return;		return;

if (SameBoundary) {		if (SameBoundary) {
		// Choose in priority instructions with lower latencies.
		if (Zone->isTop()) {
		// Top-down scheduling.
		unsigned TryDepth = TryCand.SU->getDepth();
		unsigned Depth = Cand.SU->getDepth();

		// Prefer lowest depth.
		if (TryDepth > Depth)
		return;

		// For equal depth, prefer higher height.
		if (TryDepth == Depth && TryCand.SU->getHeight() < Cand.SU->getHeight())
		return;
		} else {
		// Bottom-up scheduling.
		unsigned TryHeight = TryCand.SU->getHeight();
		unsigned Height = Cand.SU->getHeight();

		// Prefer lowest height.
		if (TryHeight > Height)
		return;

		// For equal height, prefer higher depth.
		if (TryHeight == Height && TryCand.SU->getDepth() < Cand.SU->getDepth())
		return;
		}

// Avoid critical resource consumption and balance the schedule.		// Avoid critical resource consumption and balance the schedule.
TryCand.initResourceDelta(DAG, SchedModel);		TryCand.initResourceDelta(DAG, SchedModel);
if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,		if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
TryCand, Cand, ResourceReduce))		TryCand, Cand, ResourceReduce))
return;		return;
if (tryGreater(TryCand.ResDelta.DemandedResources,		if (tryGreater(TryCand.ResDelta.DemandedResources,
Cand.ResDelta.DemandedResources,		Cand.ResDelta.DemandedResources,
TryCand, Cand, ResourceDemand))		TryCand, Cand, ResourceDemand))
return;		return;

// Avoid serializing long latency dependence chains.		// Avoid serializing long latency dependence chains.
// For acyclic path limited loops, latency was already checked above.		// For acyclic path limited loops, latency was already checked above.
if (!RegionPolicy.DisableLatencyHeuristic && TryCand.Policy.ReduceLatency &&		if (!RegionPolicy.DisableLatencyHeuristic && TryCand.Policy.ReduceLatency &&
!Rem.IsAcyclicLatencyLimited && tryLatency(TryCand, Cand, *Zone))		!Rem.IsAcyclicLatencyLimited && tryLatency(TryCand, Cand, *Zone))
return;		return;

// Fall through to original instruction order.		// Fall through to original instruction order.
if ((Zone->isTop() && TryCand.SU->NodeNum < Cand.SU->NodeNum)		if ((Zone->isTop() && TryCand.SU->NodeNum < Cand.SU->NodeNum)
\|\| (!Zone->isTop() && TryCand.SU->NodeNum > Cand.SU->NodeNum)) {		\|\| (!Zone->isTop() && TryCand.SU->NodeNum > Cand.SU->NodeNum)) {
TryCand.Reason = NodeOrder;		TryCand.Reason = NodeOrder;
		sebpopAuthorUnsubmitted Not Done Reply Inline Actions If everything above does not early return, this will pick up the instruction with the smallest SU number in top-down scheduling, and the largest SU number for bottom up scheduling. In the testcase compiled with -mcpu=cortex-a57 the load at SU(8) will be picked up by the bot scheduler before other roots at SU 3, 5, 7 for this "ORDER" reason. * Final schedule for %bb.0 * SU(0): %1:gpr64common = COPY $x1 SU(1): %0:gpr64common = COPY $x0 SU(2): %2:gpr32 = LDRBBui %1:gpr64common, 0 :: (load 1 from %ir.out, !tbaa !2) SU(4): %3:gpr32 = LDRBBui %1:gpr64common, 1 :: (load 1 from %ir.incdec.ptr, !tbaa !2) SU(6): %4:gpr32 = LDRBBui %1:gpr64common, 2 :: (load 1 from %ir.incdec.ptr2, !tbaa !2) SU(3): STRBBui %2:gpr32, %0:gpr64common, 0 :: (store 1 into %ir.in, !tbaa !2) SU(5): STRBBui %3:gpr32, %0:gpr64common, 1 :: (store 1 into %ir.incdec.ptr1, !tbaa !2) SU(7): STRBBui %4:gpr32, %0:gpr64common, 2 :: (store 1 into %ir.incdec.ptr3, !tbaa !2) SU(8): %5:gpr32 = LDRBBui %1:gpr64common, 3 :: (load 1 from %ir.incdec.ptr4, !tbaa !2) SU(9): STRBBui %5:gpr32, %0:gpr64common, 3 :: (store 1 into %ir.incdec.ptr5, !tbaa !2) SU(8) gets in the ready queue after the bottom-up schedules SU(9) who consumes the output of 8. Scheduling 8 just before 9 is very bad as there are no other instructions to hide the latency of 8. sebpop: If everything above does not early return, this will pick up the instruction with the smallest…
}		}
}		}
}		}

/// Pick the best candidate from the queue.		/// Pick the best candidate from the queue.
///		///
/// TODO: getMaxPressureDelta results can be mostly cached for each SUnit during		/// TODO: getMaxPressureDelta results can be mostly cached for each SUnit during
/// DAG building. To adjust for the current scheduling location we need to		/// DAG building. To adjust for the current scheduling location we need to
▲ Show 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	if (tryLess(Top.getLatencyStallCycles(TryCand.SU),
return;		return;

// Keep clustered nodes together.		// Keep clustered nodes together.
if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),		if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
Cand.SU == DAG->getNextClusterSucc(),		Cand.SU == DAG->getNextClusterSucc(),
TryCand, Cand, Cluster))		TryCand, Cand, Cluster))
return;		return;

		// Choose in priority instructions with lower latencies.
		if (TryCand.SU->getDepth() > Cand.SU->getDepth())
		return;

// Avoid critical resource consumption and balance the schedule.		// Avoid critical resource consumption and balance the schedule.
if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,		if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
TryCand, Cand, ResourceReduce))		TryCand, Cand, ResourceReduce))
return;		return;
if (tryGreater(TryCand.ResDelta.DemandedResources,		if (tryGreater(TryCand.ResDelta.DemandedResources,
Cand.ResDelta.DemandedResources,		Cand.ResDelta.DemandedResources,
TryCand, Cand, ResourceDemand))		TryCand, Cand, ResourceDemand))
return;		return;
▲ Show 20 Lines • Show All 367 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MI-sched] schedule following instruction latenciesAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 140861

llvm/lib/CodeGen/MachineScheduler.cpp

[MI-sched] schedule following instruction latencies
AbandonedPublic