With this patch LLVM starts scheduling instructions based on their latency.
For top-down scheduling, prefer scheduling instructions at a lower depth:
those instructions will be ready to execute before others at a higher depth.
For bottom-up scheduling, prefer scheduling instructions at a lower height.
For the following testcase:
void fun(char *restrict in, char *restrict out) { *in++ = *out++; *in++ = *out++; *in++ = *out++; *in++ = *out++; }
on aarch64 we used to produce this code :
ldrb w8, [x1] strb w8, [x0] ldrb w8, [x1, #1] strb w8, [x0, #1] ldrb w8, [x1, #2] strb w8, [x0, #2] ldrb w8, [x1, #3] strb w8, [x0, #3]
with the patch we now produce:
ldrb w8, [x1] ldrb w9, [x1, #1] ldrb w10, [x1, #2] ldrb w11, [x1, #3] strb w8, [x0] strb w9, [x0, #1] strb w10, [x0, #2] strb w11, [x0, #3]
There are about 600 tests modified by this patch (mostly on the x86 side, and a few on aarch64.)
If everything above does not early return, this will pick up the instruction with the smallest SU number in top-down scheduling, and the largest SU number for bottom up scheduling.
In the testcase compiled with -mcpu=cortex-a57 the load at SU(8) will be picked up by the bot scheduler before other roots at SU 3, 5, 7 for this "ORDER" reason.
SU(8) gets in the ready queue after the bottom-up schedules SU(9) who consumes the output of 8.
Scheduling 8 just before 9 is very bad as there are no other instructions to hide the latency of 8.