This is an archive of the discontinued LLVM Phabricator instance.

[MI-sched] schedule following instruction latencies
AbandonedPublic

Authored by sebpop on Apr 3 2018, 2:18 PM.

Details

Summary

With this patch LLVM starts scheduling instructions based on their latency.

For top-down scheduling, prefer scheduling instructions at a lower depth:
those instructions will be ready to execute before others at a higher depth.
For bottom-up scheduling, prefer scheduling instructions at a lower height.

For the following testcase:

void fun(char *restrict in, char *restrict out) {
  *in++ = *out++;
  *in++ = *out++;
  *in++ = *out++;
  *in++ = *out++;
}

on aarch64 we used to produce this code :

	ldrb	w8, [x1]
	strb	w8, [x0]
	ldrb	w8, [x1, #1]
	strb	w8, [x0, #1]
	ldrb	w8, [x1, #2]
	strb	w8, [x0, #2]
	ldrb	w8, [x1, #3]
	strb	w8, [x0, #3]

with the patch we now produce:

	ldrb	w8, [x1]
	ldrb	w9, [x1, #1]
	ldrb	w10, [x1, #2]
	ldrb	w11, [x1, #3]
	strb	w8, [x0]
	strb	w9, [x0, #1]
	strb	w10, [x0, #2]
	strb	w11, [x0, #3]

There are about 600 tests modified by this patch (mostly on the x86 side, and a few on aarch64.)

Diff Detail

Event Timeline

sebpop created this revision.Apr 3 2018, 2:18 PM

I think we should ask backend if we want to gang up loads and stores, and if we do want to gang up loads and stores, then how many should we be ganging up together. Ganging up lots of loads may result in high register pressure.

MatzeB added a comment.Apr 3 2018, 9:27 PM
This comment was removed by MatzeB.
MatzeB added a comment.Apr 3 2018, 9:31 PM

or

$ clang -arch arm64 -O3 -S -o- t.c -mcpu=generic
...
	ldrb		w8, [x1]
	ldrb	w9, [x1, #1]
	ldrb	w10, [x1, #2]
	ldrb	w11, [x1, #3]
	strb		w8, [x0]
	strb	w9, [x0, #1]
	strb	w10, [x0, #2]
	strb	w11, [x0, #3]
	ret
  • Are you sure you are using a good scheduling model?
  • A change like this needs a lot of benchmarking. In my experience real world looses all too fast when additional latency heuristics fight against register pressure. (See also the code around tryLatency()).
This comment was removed by javed.absar.

Hi Sebastian:
For the test case, the change looks good, though I am surprised the current scheduling algorithm does that (i.e. scheduling store just one cycle after load which definitely would result in stalls, especially for in-order processors). I am guessing the latency it is picking up is 1?

  • Are you sure you are using a good scheduling model?

I have seen badly scheduled code for -mcpu=cortex-a57 and the exynos-m* tunings.

Curiously cortex-a53 has the best code after inst-combine manages to fuse all the byte ld/st.
Evandro will be investigating why this does not happen for other -mcpu tunings.

In my experience real world looses all too fast when additional latency heuristics fight against register pressure. (See also the code around tryLatency()).

Thanks for the pointer to tryLatency(): the good news is that the function
has the exact code as I wrote in this patch.
tryLatency is not executed because TryCand.Policy.ReduceLatency is false:
it is only set in the following code, so the problem may be in this logic:

// Schedule aggressively for latency in PostRA mode. We don't check for
// acyclic latency during PostRA, and highly out-of-order processors will
// skip PostRA scheduling.
if (IsPostRA || (RemLatency + CurrZone.getCurrCycle() > Rem.CriticalPath)) {
  Policy.ReduceLatency |= true;

at the beginning of scheduling we are not on the critical path, so we won't set ReduceLatency.
Then we schedule an instruction without looking at its latency, which introduces a bubble
that will then exceed the critical path and consequently force to set ReduceLatency
for the remaining instructions to be scheduled.

Hi Sebastian:
For the test case, the change looks good, though I am surprised the current scheduling algorithm does that (i.e. scheduling store just one cycle after load which definitely would result in stalls, especially for in-order processors).

See the comment that I added inline to the place that falls through to the original scheduling order.

llvm/lib/CodeGen/MachineScheduler.cpp
3018

If everything above does not early return, this will pick up the instruction with the smallest SU number in top-down scheduling, and the largest SU number for bottom up scheduling.

In the testcase compiled with -mcpu=cortex-a57 the load at SU(8) will be picked up by the bot scheduler before other roots at SU 3, 5, 7 for this "ORDER" reason.

*** Final schedule for %bb.0 ***
SU(0):   %1:gpr64common = COPY $x1
SU(1):   %0:gpr64common = COPY $x0
SU(2):   %2:gpr32 = LDRBBui %1:gpr64common, 0 :: (load 1 from %ir.out, !tbaa !2)
SU(4):   %3:gpr32 = LDRBBui %1:gpr64common, 1 :: (load 1 from %ir.incdec.ptr, !tbaa !2)
SU(6):   %4:gpr32 = LDRBBui %1:gpr64common, 2 :: (load 1 from %ir.incdec.ptr2, !tbaa !2)
SU(3):   STRBBui %2:gpr32, %0:gpr64common, 0 :: (store 1 into %ir.in, !tbaa !2)
SU(5):   STRBBui %3:gpr32, %0:gpr64common, 1 :: (store 1 into %ir.incdec.ptr1, !tbaa !2)
SU(7):   STRBBui %4:gpr32, %0:gpr64common, 2 :: (store 1 into %ir.incdec.ptr3, !tbaa !2)
SU(8):   %5:gpr32 = LDRBBui %1:gpr64common, 3 :: (load 1 from %ir.incdec.ptr4, !tbaa !2)
SU(9):   STRBBui %5:gpr32, %0:gpr64common, 3 :: (store 1 into %ir.incdec.ptr5, !tbaa !2)

SU(8) gets in the ready queue after the bottom-up schedules SU(9) who consumes the output of 8.
Scheduling 8 just before 9 is very bad as there are no other instructions to hide the latency of 8.

fhahn added a subscriber: fhahn.Apr 4 2018, 11:52 AM

A while ago I put up a patch to make the scheduler slightly more aggressive with using latency D38279. I did not really have time to follow it up though.

sebpop abandoned this revision.Apr 4 2018, 1:27 PM

Thanks @fhahn for the pointer: I'm closing this revision and I will try to fix the problem with something similar to D38279.
I tried that code out and it is not modifying the current behaviour of the scheduler.

dmgreen added a subscriber: dmgreen.Apr 4 2018, 1:37 PM
fhahn added a subscriber: jonpa.Apr 12 2018, 10:02 AM

Great, I am looking forward to hearing what you find! @jonpa just landed r329884, which should make it easier to add target-specific MachineScheduler's with customized heuristics; basically all one has to do is to implement a custom tryCandidate function. IMO it would make it easier to experiment with heuristics if we would have a separate scheduler with experimental heuristics for AArch64.